BigTrans: 100개 이상의 언어에 걸친 다국어 번역 기능을 통한 대형 언어 모델 강화

초록

대규모 언어 모델(LLM)은 다양한 자연어 간 번역에서 유망한 성능을 보여줍니다. 그러나 BLOOM이나 LLaMA와 같은 오픈소스 LLM은 특히 영어 중심적이며 수십 개의 자연어만을 지원하기 때문에, LLM의 언어 번역 잠재력이 충분히 탐구되지 못하고 있습니다. 본 연구에서는 20개 언어만을 지원하는 LLaMA를 기반으로 100개 이상의 언어에 대한 다국어 번역 능력을 강화한 BigTrans를 제안합니다. BigTrans는 LLaMA-13B를 기반으로 세 단계의 최적화 과정을 거쳐 구축되었습니다. 첫째, 대규모 중국어 단일 언어 데이터를 사용하여 LLaMA를 추가 학습시켰습니다. 둘째, 102개 자연어를 아우르는 대규모 병렬 데이터셋으로 모델을 추가 학습시켰습니다. 셋째, 다국어 번역 지시문을 사용하여 기반 모델을 지시 튜닝함으로써 BigTrans 모델을 완성했습니다. 다국어 번역에 대한 예비 실험 결과, BigTrans는 많은 언어에서 ChatGPT 및 Google 번역과 비슷한 성능을 보였으며, 8개 언어 쌍에서는 ChatGPT를 능가하는 성과를 거두었습니다. 우리는 BigTrans 모델을 공개하여 연구 발전에 기여하고자 합니다.

English

Large language models (LLMs) demonstrate promising translation performance among various natural languages. However, many LLMs especially the open-sourced ones, such as BLOOM and LLaMA, are English-dominant and support only dozens of natural languages, making the potential of LLMs on language translation less explored. In this work, we present BigTrans which adapts LLaMA that covers only 20 languages and enhances it with multilingual translation capability on more than 100 languages. BigTrans is built upon LLaMA-13B and it is optimized in three steps. First, we continue training LLaMA with massive Chinese monolingual data. Second, we continue training the model with a large-scale parallel dataset that covers 102 natural languages. Third, we instruct-tune the foundation model with multilingual translation instructions, leading to our BigTrans model. The preliminary experiments on multilingual translation show that BigTrans performs comparably with ChatGPT and Google Translate in many languages and even outperforms ChatGPT in 8 language pairs. We release the BigTrans model and hope it can advance the research progress.

BigTrans: 100개 이상의 언어에 걸친 다국어 번역 기능을 통한 대형 언어 모델 강화

BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

초록

Support