FuxiTranyu：使用平衡数据训练的多语言大型语言模型

摘要

大型语言模型（LLMs）已经展示了在各种任务中的实力。然而，许多LLMs在高资源语言和低资源语言之间表现出显著的性能差异。为了缓解这一挑战，我们提出了FuxiTranyu，一个开源的多语言LLM，旨在满足研究社区对平衡和高性能多语言能力的需求。FuxiTranyu-8B，拥有80亿参数的基础模型，是从头开始训练的，使用了一个精心平衡的多语言数据库，其中包含了涵盖43种自然语言和16种编程语言的6000亿标记。除了基础模型外，我们还开发了两个指令调整模型：FuxiTranyu-8B-SFT，在多样化的多语言指令数据集上进行微调，以及FuxiTranyu-8B-DPO，通过在偏好数据集上进一步优化DPO，以增强对齐能力。对各种多语言基准的广泛实验表明，FuxiTranyu相对于现有的多语言LLMs（如BLOOM-7B，PolyLM-13B，Llama-2-Chat-7B和Mistral-7B-Instruct）具有竞争力的性能。在神经元和表示水平上进行的可解释性分析表明，FuxiTranyu能够学习一致的多语言表示，跨越不同语言。为了促进对多语言LLMs及其工作机制的进一步研究，我们在HuggingFace和Github上发布了基础模型和指令调整的FuxiTranyu模型，以及58个预训练检查点。

English

Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is trained from scratch on a meticulously balanced multilingual data repository that contains 600 billion tokens covering 43 natural languages and 16 programming languages. In addition to the base model, we also develop two instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability analyses at both the neuron and representation level suggest that FuxiTranyu is able to learn consistent multilingual representations across different languages. To promote further research into multilingual LLMs and their working mechanisms, we release both the base and instruction-tuned FuxiTranyu models together with 58 pretraining checkpoints at HuggingFace and Github.

FuxiTranyu：使用平衡数据训练的多语言大型语言模型

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

摘要

Support