FuxiTranyu：バランスの取れたデータで訓練された多言語大規模言語モデル

要旨

大規模言語モデル（LLM）は、幅広いタスクにおいてその能力を発揮してきた。しかし、多くのLLMは、高リソース言語と低リソース言語の間で性能に大きな差が見られる。この課題を緩和するため、我々は研究コミュニティのニーズに応えるべく、バランスの取れた高性能な多言語能力を備えたオープンソースの多言語LLM「FuxiTranyu」を提案する。FuxiTranyu-8Bは、80億パラメータを有するベースモデルであり、43の自然言語と16のプログラミング言語をカバーする6000億トークンからなる慎重にバランスを取った多言語データリポジトリを用いてゼロから訓練されている。ベースモデルに加えて、我々は2つの指示チューニングモデルも開発した。多様な多言語指示データセットでファインチューニングされたFuxiTranyu-8B-SFTと、アライメント能力を強化するために選好データセットでDPO（Direct Preference Optimization）を用いてさらに洗練されたFuxiTranyu-8B-DPOである。多岐にわたる多言語ベンチマークでの広範な実験により、FuxiTranyuが既存の多言語LLM（例：BLOOM-7B、PolyLM-13B、Llama-2-Chat-7B、Mistral-7B-Instruct）に対して競争力のある性能を発揮することが示された。ニューロンおよび表現レベルでの解釈可能性分析は、FuxiTranyuが異なる言語間で一貫した多言語表現を学習できることを示唆している。多言語LLMとその動作メカニズムに関するさらなる研究を促進するため、我々はベースモデルと指示チューニングモデルの両方、および58の事前学習チェックポイントをHuggingFaceとGithubで公開する。

English

Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is trained from scratch on a meticulously balanced multilingual data repository that contains 600 billion tokens covering 43 natural languages and 16 programming languages. In addition to the base model, we also develop two instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability analyses at both the neuron and representation level suggest that FuxiTranyu is able to learn consistent multilingual representations across different languages. To promote further research into multilingual LLMs and their working mechanisms, we release both the base and instruction-tuned FuxiTranyu models together with 58 pretraining checkpoints at HuggingFace and Github.

FuxiTranyu：バランスの取れたデータで訓練された多言語大規模言語モデル

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

要旨

Support