FuxiTranyu：使用平衡數據訓練的多語言大型語言模型

摘要

大型語言模型（LLMs）展現了在各種任務中的優勢。然而，許多LLMs在高資源語言和低資源語言之間表現出顯著的性能差異。為了解決這一挑戰，我們提出了FuxiTranyu，這是一個開源的多語言LLM，旨在滿足研究社群對平衡和高性能多語言能力的需求。FuxiTranyu-8B是基礎模型，具有80億個參數，從頭開始訓練，使用一個精心平衡的多語言數據庫，包含6000億個標記，涵蓋43種自然語言和16種編程語言。除了基礎模型外，我們還開發了兩個指令調整的模型：FuxiTranyu-8B-SFT在多樣化的多語言指令數據集上進行微調，FuxiTranyu-8B-DPO則在偏好數據集上進一步優化DPO以提高對齊能力。在廣泛的多語言基準測試中進行的大量實驗顯示，FuxiTranyu相對於現有的多語言LLMs（例如BLOOM-7B、PolyLM-13B、Llama-2-Chat-7B和Mistral-7B-Instruct）具有競爭力的性能。在神經元和表示層面的可解釋性分析表明，FuxiTranyu能夠學習跨不同語言的一致多語言表示。為了促進對多語言LLMs及其工作機制的進一步研究，我們在HuggingFace和Github上發布了基礎和指令調整的FuxiTranyu模型，以及58個預訓練檢查點。

English

Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is trained from scratch on a meticulously balanced multilingual data repository that contains 600 billion tokens covering 43 natural languages and 16 programming languages. In addition to the base model, we also develop two instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability analyses at both the neuron and representation level suggest that FuxiTranyu is able to learn consistent multilingual representations across different languages. To promote further research into multilingual LLMs and their working mechanisms, we release both the base and instruction-tuned FuxiTranyu models together with 58 pretraining checkpoints at HuggingFace and Github.

FuxiTranyu：使用平衡數據訓練的多語言大型語言模型

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

摘要

Support