Llama-GENBA-10B:面向德语、英语及巴伐利亚语的三语种大语言模型
Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian
September 6, 2025
作者: Michael Hoffmann, Jophin John, Stefan Schweter, Gokul Ramakrishnan, Hoi-Fong Mak, Alice Zhang, Dmitry Gaynullin, Nicolay J. Hammer
cs.AI
摘要
我们推出Llama-GENBA-10B,这是一款针对大型语言模型中英语中心偏见的三种语言基础模型。该模型基于Llama 3.1-8B构建,并扩展至100亿参数,通过持续预训练1640亿个token(其中820亿为英语,820亿为德语,8000万为巴伐利亚语),在资源分配上实现平衡,避免英语主导。Llama-GENBA-10B主要面向德语自然语言处理社区,同时推动巴伐利亚语作为低资源语言的发展。开发过程中克服了四大挑战:(1) 在巴伐利亚语资源稀缺的情况下构建多语言语料库,(2) 创建适用于英语、德语和巴伐利亚语的统一分词器,(3) 优化架构及语言比例超参数以促进跨语言迁移,(4) 通过将德语基准翻译成巴伐利亚语,建立首个标准化的三语评估套件。评估结果显示,Llama-GENBA-10B展现出强劲的跨语言性能,其微调版本在巴伐利亚语上超越Apertus-8B-2509和gemma-2-9b,成为该语言类别中的最佳模型,同时在英语上优于EuroLLM,在德语上与之持平。在Cerebras CS-2上的训练展示了高效的大规模多语言预训练,并记录了能源使用情况,为整合低资源语言的包容性基础模型提供了蓝图。
English
We present Llama-GENBA-10B, a trilingual foundation model addressing
English-centric bias in large language models. Built on Llama 3.1-8B and scaled
to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens
(82B English, 82B German, and 80M Bavarian), balancing resources while
preventing English dominance. Targeted at the German NLP community, the model
also promotes Bavarian as a low-resource language. Development tackled four
challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2)
creating a unified tokenizer for English, German, and Bavarian, (3) optimizing
architecture and language-ratio hyperparameters for cross-lingual transfer, and
(4) establishing the first standardized trilingual evaluation suite by
translating German benchmarks into Bavarian. Evaluations show that
Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned
variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing
itself as the best model in its class for this language, while also
outperforming EuroLLM in English and matching its results in German. Training
on the Cerebras CS-2 demonstrated efficient large-scale multilingual
pretraining with documented energy use, offering a blueprint for inclusive
foundation models that integrate low-resource languages.