ChatPaper.aiChatPaper

PolyLM:一个开源的多语言大型语言模型

PolyLM: An Open Source Polyglot Large Language Model

July 12, 2023
作者: Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu, Shangjie Li, Binyuan Hui, Bowen Yu, Dayiheng Liu, Baosong Yang, Fei Huang, Jun Xie
cs.AI

摘要

大型语言模型(LLMs)展示了出色的理解、推理和生成自然语言指令的能力。然而,LLMs的发展主要集中在高资源语言,如英语,从而限制了它们在其他语言中的适用性和研究。因此,我们提出了PolyLM,一个在6400亿(B)标记上训练的多语言LLM,有两种模型大小:1.7B和13B。为了增强其多语言能力,我们1)将双语数据整合到训练数据中;2)采用课程学习策略,在预训练期间将非英语数据的比例从第一阶段的30%增加到最终阶段的60%。此外,我们提出了一种多语言自我指导方法,自动生成了132.7K多样化的多语言指令用于模型微调。为了评估模型的性能,我们收集了几个现有的多语言任务,包括多语言理解、问答、生成和翻译。广泛的实验表明,PolyLM在多语言任务上超越了其他开源模型,如LLaMA和BLOOM,同时在英语中保持了可比的性能。我们的模型,连同指令数据和多语言基准,可在以下网址获取:https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation。
English
Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation.
PDF264December 15, 2024