Llama-GENBA-10B:一個面向德語、英語及巴伐利亞語的三語大型語言模型
Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian
September 6, 2025
作者: Michael Hoffmann, Jophin John, Stefan Schweter, Gokul Ramakrishnan, Hoi-Fong Mak, Alice Zhang, Dmitry Gaynullin, Nicolay J. Hammer
cs.AI
摘要
我們推出Llama-GENBA-10B,這是一個三語基礎模型,旨在解決大型語言模型中的英語中心偏見。該模型基於Llama 3.1-8B構建,並擴展至100億參數,持續預訓練於1640億個詞元(820億英語、820億德語及8000萬巴伐利亞語),在平衡資源分配的同時防止英語主導。針對德語自然語言處理社群,此模型亦致力於推廣巴伐利亞語這一低資源語言。開發過程中克服了四大挑戰:(1)在巴伐利亞語稀缺的情況下策劃多語料庫,(2)創建適用於英語、德語及巴伐利亞語的統一化分詞器,(3)優化架構與語言比例超參數以促進跨語言遷移,以及(4)通過將德語基準翻譯成巴伐利亞語,建立首個標準化三語評估套件。評估結果顯示,Llama-GENBA-10B展現出卓越的跨語言性能,其微調版本在巴伐利亞語上超越Apertus-8B-2509與gemma-2-9b,成為該語言類別中的最佳模型,同時在英語上優於EuroLLM,並在德語上與之匹敵。在Cerebras CS-2上的訓練展示了高效的大規模多語言預訓練,並記錄了能源使用情況,為整合低資源語言的包容性基礎模型提供了藍圖。
English
We present Llama-GENBA-10B, a trilingual foundation model addressing
English-centric bias in large language models. Built on Llama 3.1-8B and scaled
to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens
(82B English, 82B German, and 80M Bavarian), balancing resources while
preventing English dominance. Targeted at the German NLP community, the model
also promotes Bavarian as a low-resource language. Development tackled four
challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2)
creating a unified tokenizer for English, German, and Bavarian, (3) optimizing
architecture and language-ratio hyperparameters for cross-lingual transfer, and
(4) establishing the first standardized trilingual evaluation suite by
translating German benchmarks into Bavarian. Evaluations show that
Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned
variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing
itself as the best model in its class for this language, while also
outperforming EuroLLM in English and matching its results in German. Training
on the Cerebras CS-2 demonstrated efficient large-scale multilingual
pretraining with documented energy use, offering a blueprint for inclusive
foundation models that integrate low-resource languages.