FuxiTranyu: 균형 잡힌 데이터로 훈련된 다국어 대형 언어 모델

초록

대형 언어 모델(LLM)은 다양한 작업에서 뛰어난 능력을 보여주었습니다. 그러나 많은 LLM은 고자원 언어와 저자원 언어 간에 상당한 성능 차이를 나타냅니다. 이러한 도전에 대처하기 위해, 우리는 균형 잡힌 고성능 다국어 능력을 충족시키기 위해 설계된 오픈 소스 다국어 LLM인 FuxiTranyu를 제안합니다. 80억 개의 매개변수를 갖는 기본 모델인 FuxiTranyu-8B는 43개의 자연 언어와 16개의 프로그래밍 언어를 포함한 6000억 토큰을 다루는 균형 잡힌 다국어 데이터 저장소에서 처음부터 훈련되었습니다. 기본 모델 외에도, 우리는 두 가지의 지시어에 맞게 조정된 모델을 개발했습니다. 다양한 다국어 지시어 데이터셋에 맞게 세밀하게 조정된 FuxiTranyu-8B-SFT와 향상된 정렬 능력을 위해 선호 데이터셋에서 DPO로 더욱 정제된 FuxiTranyu-8B-DPO가 있습니다. 다양한 다국어 벤치마크에서의 광범위한 실험 결과는 FuxiTranyu의 경쟁력 있는 성능을 입증하며, 기존 다국어 LLM인 BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B 및 Mistral-7B-Instruct와 대비됩니다. 뉴런 및 표현 수준에서의 해석 가능성 분석은 FuxiTranyu가 다양한 언어 간에 일관된 다국어 표현을 학습할 수 있다는 것을 시사합니다. 더 많은 다국어 LLM 및 그 작동 메커니즘에 대한 연구를 촉진하기 위해, 우리는 HuggingFace와 Github에서 기본 및 지시어에 맞게 조정된 FuxiTranyu 모델과 58개의 사전 훈련 체크포인트를 함께 공개합니다.

English

Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is trained from scratch on a meticulously balanced multilingual data repository that contains 600 billion tokens covering 43 natural languages and 16 programming languages. In addition to the base model, we also develop two instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability analyses at both the neuron and representation level suggest that FuxiTranyu is able to learn consistent multilingual representations across different languages. To promote further research into multilingual LLMs and their working mechanisms, we release both the base and instruction-tuned FuxiTranyu models together with 58 pretraining checkpoints at HuggingFace and Github.

FuxiTranyu: 균형 잡힌 데이터로 훈련된 다국어 대형 언어 모델

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

초록

Support