Llama-GENBA-10B：一個面向德語、英語及巴伐利亞語的三語大型語言模型

摘要

我們推出Llama-GENBA-10B，這是一個三語基礎模型，旨在解決大型語言模型中的英語中心偏見。該模型基於Llama 3.1-8B構建，並擴展至100億參數，持續預訓練於1640億個詞元（820億英語、820億德語及8000萬巴伐利亞語），在平衡資源分配的同時防止英語主導。針對德語自然語言處理社群，此模型亦致力於推廣巴伐利亞語這一低資源語言。開發過程中克服了四大挑戰：(1)在巴伐利亞語稀缺的情況下策劃多語料庫，(2)創建適用於英語、德語及巴伐利亞語的統一化分詞器，(3)優化架構與語言比例超參數以促進跨語言遷移，以及(4)通過將德語基準翻譯成巴伐利亞語，建立首個標準化三語評估套件。評估結果顯示，Llama-GENBA-10B展現出卓越的跨語言性能，其微調版本在巴伐利亞語上超越Apertus-8B-2509與gemma-2-9b，成為該語言類別中的最佳模型，同時在英語上優於EuroLLM，並在德語上與之匹敵。在Cerebras CS-2上的訓練展示了高效的大規模多語言預訓練，並記錄了能源使用情況，為整合低資源語言的包容性基礎模型提供了藍圖。

English

We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.

Llama-GENBA-10B：一個面向德語、英語及巴伐利亞語的三語大型語言模型

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

摘要

Support