PolyLM：一個開源的多語言大型語言模型

摘要

大型語言模型（LLMs）展示了出色的理解、推理和生成自然語言指令的能力。然而，LLMs的發展主要集中在高資源語言，如英語，因此限制了它們在其他語言中的應用和研究。因此，我們提出了PolyLM，一個在6400億（B）標記上訓練的多語言LLM，有兩種模型大小：1.7B和13B。為了增強其多語言能力，我們1）將雙語數據整合到訓練數據中；2）採用課程學習策略，在預訓練期間將非英語數據的比例從第一階段的30%增加到最終階段的60%。此外，我們提出了一種多語言自我指導方法，可以自動生成132.7K多樣的多語言指令，用於模型微調。為了評估模型的性能，我們收集了幾個現有的多語言任務，包括多語言理解、問答、生成和翻譯。廣泛的實驗表明，PolyLM在多語言任務上超越了其他開源模型，如LLaMA和BLOOM，同時在英語方面保持了可比的性能。我們的模型、指令數據和多語言基準，可在以下網址找到：https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation。

English

Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation.

PolyLM：一個開源的多語言大型語言模型

PolyLM: An Open Source Polyglot Large Language Model

摘要

Support