PolyLM: オープンソースの多言語大規模言語モデル

要旨

大規模言語モデル（LLM）は、自然言語の指示を理解し、推論し、生成する能力において顕著な性能を示します。しかし、LLMの開発は主に英語などの高リソース言語に焦点が当てられており、他の言語での適用性や研究が制限されています。そこで、我々は6400億トークンで訓練された多言語LLMであるPolyLMを提案します。PolyLMは1.7Bと13Bの2つのモデルサイズで提供されます。その多言語能力を強化するために、1) 訓練データに二言語データを統合し、2) 事前学習の初期段階では非英語データの割合を30%とし、最終段階では60%に増やすカリキュラム学習戦略を採用しました。さらに、モデルのファインチューニングのために132.7Kの多様な多言語指示を自動生成する多言語自己指示手法を提案します。モデルの性能を評価するために、多言語理解、質問応答、生成、翻訳を含む既存の多言語タスクを収集しました。大規模な実験により、PolyLMはLLaMAやBLOOMなどのオープンソースモデルを多言語タスクで上回り、英語での性能も同等に維持することが示されました。我々のモデル、指示データ、および多言語ベンチマークは、https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation で公開されています。

English

Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation.

PolyLM: オープンソースの多言語大規模言語モデル

PolyLM: An Open Source Polyglot Large Language Model

要旨

Support