ChatPaper.aiChatPaper

PolyLM:一個開源的多語言大型語言模型

PolyLM: An Open Source Polyglot Large Language Model

July 12, 2023
作者: Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu, Shangjie Li, Binyuan Hui, Bowen Yu, Dayiheng Liu, Baosong Yang, Fei Huang, Jun Xie
cs.AI

摘要

大型語言模型(LLMs)展示了出色的理解、推理和生成自然語言指令的能力。然而,LLMs的發展主要集中在高資源語言,如英語,因此限制了它們在其他語言中的應用和研究。因此,我們提出了PolyLM,一個在6400億(B)標記上訓練的多語言LLM,有兩種模型大小:1.7B和13B。為了增強其多語言能力,我們1)將雙語數據整合到訓練數據中;2)採用課程學習策略,在預訓練期間將非英語數據的比例從第一階段的30%增加到最終階段的60%。此外,我們提出了一種多語言自我指導方法,可以自動生成132.7K多樣的多語言指令,用於模型微調。為了評估模型的性能,我們收集了幾個現有的多語言任務,包括多語言理解、問答、生成和翻譯。廣泛的實驗表明,PolyLM在多語言任務上超越了其他開源模型,如LLaMA和BLOOM,同時在英語方面保持了可比的性能。我們的模型、指令數據和多語言基準,可在以下網址找到:https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation。
English
Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation.
PDF264December 15, 2024