CAIDAS publishes first all-German large language model “LLäMmlein”

Two cartoon-style sheep, one of which is wearing a scarf. In the background is the silhouette of a Bavarian-style town.

We are presenting a milestone for German-language large language models: At the University of Würzburg (JMU) the first all-German large-scale language model has been created and trained at the Center for Artificial Intelligence and Data Science (CAIDAS) with resources support provided by the NHR@FAU.

Right now, the JMU is leading a breakthrough in German-language LLMs. Two new models have been successfully trained, including the LLäMmlein 120M and the more powerful LLäMmlein 1B with over a billion parameters, exclusively in German. The models will be released to the public today on November 15, 2024.

So far, many large language models have mainly been trained using English data sets. This is where Professor Dr. Andreas Hotho from CAIDAS and his team started their work: “With LLäMmlein we have created models that were trained exclusively based on German-language data. This focuses not only on German language processing and opens up new possibilities for applications that are specifically customized for the German language, but also concentrates on the study of German language models.”

Publishing the dataset and several checkpoints from the training phase helps researchers to study and improve the learning dynamics of the models. To monitor the progress of the training and evaluate the final result, the self-developed benchmark “SuperGLEBer” with 29 tasks was used to evaluate German LLMs.

“We are presenting two models of different sizes: LLäMmlein 120M and 1B. These offer an inside look at how model size influences performance. In addition, we provide special chat options optimized for interactive applications,” explains Andreas Hotho.

The different models let developers and researchers select one that fits their specific needs; for instance, there is already a preview of a Bavarian version of the language model.

This project marks the kick-off for the development of even larger models. The extensive computations were carried out at the Alex Cluster of NHR@FAU and required 50,000 computing hours on A100 GPUs with 80GB memory for the 1B model. The training took around 5 weeks on 64 GPUs.