BigTrans: 100言語以上の多言語翻訳機能を統合した大規模言語モデルの拡張

要旨

大規模言語モデル（LLM）は、さまざまな自然言語間での翻訳性能において有望な結果を示しています。しかし、多くのLLM、特にBLOOMやLLaMAなどのオープンソースモデルは、英語を中心としており、数十の自然言語しかサポートしていないため、LLMの言語翻訳における可能性が十分に探求されていません。本研究では、20言語しかカバーしていないLLaMAを適応させ、100以上の言語に対応する多言語翻訳能力を強化したBigTransを提案します。BigTransはLLaMA-13Bを基盤として構築され、3つのステップで最適化されています。まず、大規模な中国語単一言語データを用いてLLaMAを継続学習します。次に、102の自然言語をカバーする大規模な並列データセットを用いてモデルを継続学習します。最後に、多言語翻訳指示を用いて基盤モデルを指示チューニングし、BigTransモデルを完成させます。多言語翻訳に関する予備実験では、BigTransは多くの言語においてChatGPTやGoogle翻訳と同等の性能を示し、8つの言語ペアではChatGPTを上回る結果を得ました。私たちはBigTransモデルを公開し、研究の進展に貢献することを期待しています。

English

Large language models (LLMs) demonstrate promising translation performance among various natural languages. However, many LLMs especially the open-sourced ones, such as BLOOM and LLaMA, are English-dominant and support only dozens of natural languages, making the potential of LLMs on language translation less explored. In this work, we present BigTrans which adapts LLaMA that covers only 20 languages and enhances it with multilingual translation capability on more than 100 languages. BigTrans is built upon LLaMA-13B and it is optimized in three steps. First, we continue training LLaMA with massive Chinese monolingual data. Second, we continue training the model with a large-scale parallel dataset that covers 102 natural languages. Third, we instruct-tune the foundation model with multilingual translation instructions, leading to our BigTrans model. The preliminary experiments on multilingual translation show that BigTrans performs comparably with ChatGPT and Google Translate in many languages and even outperforms ChatGPT in 8 language pairs. We release the BigTrans model and hope it can advance the research progress.

BigTrans: 100言語以上の多言語翻訳機能を統合した大規模言語モデルの拡張

BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

要旨

Support