FuxiTranyu:使用平衡數據訓練的多語言大型語言模型
FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data
August 12, 2024
作者: Haoran Sun, Renren Jin, Shaoyang Xu, Leiyu Pan, Supryadi, Menglong Cui, Jiangcun Du, Yikun Lei, Lei Yang, Ling Shi, Juesi Xiao, Shaolin Zhu, Deyi Xiong
cs.AI
摘要
大型語言模型(LLMs)展現了在各種任務中的優勢。然而,許多LLMs在高資源語言和低資源語言之間表現出顯著的性能差異。為了解決這一挑戰,我們提出了FuxiTranyu,這是一個開源的多語言LLM,旨在滿足研究社群對平衡和高性能多語言能力的需求。FuxiTranyu-8B是基礎模型,具有80億個參數,從頭開始訓練,使用一個精心平衡的多語言數據庫,包含6000億個標記,涵蓋43種自然語言和16種編程語言。除了基礎模型外,我們還開發了兩個指令調整的模型:FuxiTranyu-8B-SFT在多樣化的多語言指令數據集上進行微調,FuxiTranyu-8B-DPO則在偏好數據集上進一步優化DPO以提高對齊能力。在廣泛的多語言基準測試中進行的大量實驗顯示,FuxiTranyu相對於現有的多語言LLMs(例如BLOOM-7B、PolyLM-13B、Llama-2-Chat-7B和Mistral-7B-Instruct)具有競爭力的性能。在神經元和表示層面的可解釋性分析表明,FuxiTranyu能夠學習跨不同語言的一致多語言表示。為了促進對多語言LLMs及其工作機制的進一步研究,我們在HuggingFace和Github上發布了基礎和指令調整的FuxiTranyu模型,以及58個預訓練檢查點。
English
Large language models (LLMs) have demonstrated prowess in a wide range of
tasks. However, many LLMs exhibit significant performance discrepancies between
high- and low-resource languages. To mitigate this challenge, we present
FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the
need of the research community for balanced and high-performing multilingual
capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is
trained from scratch on a meticulously balanced multilingual data repository
that contains 600 billion tokens covering 43 natural languages and 16
programming languages. In addition to the base model, we also develop two
instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse
multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined
with DPO on a preference dataset for enhanced alignment ability. Extensive
experiments on a wide range of multilingual benchmarks demonstrate the
competitive performance of FuxiTranyu against existing multilingual LLMs, e.g.,
BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability
analyses at both the neuron and representation level suggest that FuxiTranyu is
able to learn consistent multilingual representations across different
languages. To promote further research into multilingual LLMs and their working
mechanisms, we release both the base and instruction-tuned FuxiTranyu models
together with 58 pretraining checkpoints at HuggingFace and Github.Summary
AI-Generated Summary