PolyLM：一个开源的多语言大型语言模型

摘要

大型语言模型（LLMs）展示了出色的理解、推理和生成自然语言指令的能力。然而，LLMs的发展主要集中在高资源语言，如英语，从而限制了它们在其他语言中的适用性和研究。因此，我们提出了PolyLM，一个在6400亿（B）标记上训练的多语言LLM，有两种模型大小：1.7B和13B。为了增强其多语言能力，我们1）将双语数据整合到训练数据中；2）采用课程学习策略，在预训练期间将非英语数据的比例从第一阶段的30%增加到最终阶段的60%。此外，我们提出了一种多语言自我指导方法，自动生成了132.7K多样化的多语言指令用于模型微调。为了评估模型的性能，我们收集了几个现有的多语言任务，包括多语言理解、问答、生成和翻译。广泛的实验表明，PolyLM在多语言任务上超越了其他开源模型，如LLaMA和BLOOM，同时在英语中保持了可比的性能。我们的模型，连同指令数据和多语言基准，可在以下网址获取：https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation。

English

Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation.

PolyLM：一个开源的多语言大型语言模型

PolyLM: An Open Source Polyglot Large Language Model

摘要

Support