Babel：90%以上の世界人口をカバーする多言語大規模言語モデルのオープン提供

要旨

大規模言語モデル（LLM）は自然言語処理（NLP）に革命をもたらしましたが、オープンソースの多言語LLMは依然として少なく、既存のモデルはしばしば言語カバレッジが限られています。そのようなモデルは通常、リソースが豊富な言語を優先し、広く話されているがリソースが不足している言語は見過ごされがちです。この格差を解消するため、我々はBabelを紹介します。Babelは、話者数で上位25の言語をカバーし、世界人口の90％以上をサポートし、他のオープン多言語LLMでは無視されている多くの言語を含むオープンな多言語LLMです。従来の継続事前学習アプローチとは異なり、Babelは層拡張技術を通じてパラメータ数を拡大し、性能の上限を引き上げます。我々は2つのバリエーションを紹介します：効率的な推論とファインチューニングを目的としたBabel-9Bと、オープン多言語LLMの新たな基準を設定するBabel-83Bです。多言語タスクにおける広範な評価は、同規模のオープンLLMと比較してその優れた性能を示しています。さらに、オープンソースの教師ありファインチューニングデータセットを使用することで、Babelは顕著な性能を達成し、Babel-9B-Chatは10BサイズのLLMの中でトップを記録し、Babel-83B-Chatは多言語タスクにおいて商用モデルと同等のレベルに達する新たな基準を設定しました。

English

Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce Babel, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: Babel-9B, designed for efficient inference and fine-tuning, and Babel-83B, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.