FuxiTranyu:使用平衡数据训练的多语言大型语言模型
FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data
August 12, 2024
作者: Haoran Sun, Renren Jin, Shaoyang Xu, Leiyu Pan, Supryadi, Menglong Cui, Jiangcun Du, Yikun Lei, Lei Yang, Ling Shi, Juesi Xiao, Shaolin Zhu, Deyi Xiong
cs.AI
摘要
大型语言模型(LLMs)已经展示了在各种任务中的实力。然而,许多LLMs在高资源语言和低资源语言之间表现出显著的性能差异。为了缓解这一挑战,我们提出了FuxiTranyu,一个开源的多语言LLM,旨在满足研究社区对平衡和高性能多语言能力的需求。FuxiTranyu-8B,拥有80亿参数的基础模型,是从头开始训练的,使用了一个精心平衡的多语言数据库,其中包含了涵盖43种自然语言和16种编程语言的6000亿标记。除了基础模型外,我们还开发了两个指令调整模型:FuxiTranyu-8B-SFT,在多样化的多语言指令数据集上进行微调,以及FuxiTranyu-8B-DPO,通过在偏好数据集上进一步优化DPO,以增强对齐能力。对各种多语言基准的广泛实验表明,FuxiTranyu相对于现有的多语言LLMs(如BLOOM-7B,PolyLM-13B,Llama-2-Chat-7B和Mistral-7B-Instruct)具有竞争力的性能。在神经元和表示水平上进行的可解释性分析表明,FuxiTranyu能够学习一致的多语言表示,跨越不同语言。为了促进对多语言LLMs及其工作机制的进一步研究,我们在HuggingFace和Github上发布了基础模型和指令调整的FuxiTranyu模型,以及58个预训练检查点。
English
Large language models (LLMs) have demonstrated prowess in a wide range of
tasks. However, many LLMs exhibit significant performance discrepancies between
high- and low-resource languages. To mitigate this challenge, we present
FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the
need of the research community for balanced and high-performing multilingual
capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is
trained from scratch on a meticulously balanced multilingual data repository
that contains 600 billion tokens covering 43 natural languages and 16
programming languages. In addition to the base model, we also develop two
instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse
multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined
with DPO on a preference dataset for enhanced alignment ability. Extensive
experiments on a wide range of multilingual benchmarks demonstrate the
competitive performance of FuxiTranyu against existing multilingual LLMs, e.g.,
BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability
analyses at both the neuron and representation level suggest that FuxiTranyu is
able to learn consistent multilingual representations across different
languages. To promote further research into multilingual LLMs and their working
mechanisms, we release both the base and instruction-tuned FuxiTranyu models
together with 58 pretraining checkpoints at HuggingFace and Github.Summary
AI-Generated Summary