BayLing: 大規模言語モデルのためのインタラクティブ翻訳を通じた言語間アライメントと指示追従の橋渡し

要旨

大規模言語モデル（LLM）は、言語理解と生成において顕著な能力を発揮しています。基盤となるLLMから指示追従型LLMへと進化する中で、指示チューニングはLLMを人間の好みに合わせる上で重要な役割を果たします。しかし、既存のLLMは通常英語に焦点を当てており、非英語言語での性能が劣る傾向にあります。非英語言語の性能を向上させるためには、基盤LLM向けの言語固有のトレーニングデータを収集し、指示チューニング用の言語固有の指示を構築する必要がありますが、これらは多大な負荷を伴います。人間の作業負荷を最小限に抑えるため、我々はインタラクティブ翻訳タスクを通じて、英語から他の言語への言語生成能力と指示追従能力を転移することを提案します。我々はLLaMAを基盤LLMとして利用し、インタラクティブ翻訳指示を自動構築することで、指示追従型LLMであるBayLingを開発しました。広範な評価により、BayLingはGPT-3.5-turboと同等の性能を達成しつつ、わずか130億パラメータという大幅に小さいサイズで実現されていることが示されました。翻訳タスクにおける実験結果では、BayLingは自動評価においてGPT-4の単一ターン翻訳能力の95%、人間評価においてGPT-3.5-turboのインタラクティブ翻訳能力の96%を達成しています。一般的なタスクでの性能を推定するため、我々はBayLing-80というマルチターン指示テストセットを作成しました。BayLing-80での実験結果は、BayLingがGPT-3.5-turboの性能の89%を達成していることを示しています。また、BayLingは中国の高考と英語のSATの知識評価においても優れた性能を示し、多数の指示追従型LLMの中でGPT-3.5-turboに次ぐ結果を出しています。BayLingのデモ、ホームページ、コード、およびモデルは公開されています。

English

Large language models (LLMs) have demonstrated remarkable prowess in language understanding and generation. Advancing from foundation LLMs to instructionfollowing LLMs, instruction tuning plays a vital role in aligning LLMs to human preferences. However, the existing LLMs are usually focused on English, leading to inferior performance in non-English languages. In order to improve the performance for non-English languages, it is necessary to collect language-specific training data for foundation LLMs and construct language-specific instructions for instruction tuning, both of which are heavy loads. To minimize human workload, we propose to transfer the capabilities of language generation and instruction following from English to other languages through an interactive translation task. We have developed BayLing, an instruction-following LLM by utilizing LLaMA as the foundation LLM and automatically constructing interactive translation instructions for instructing tuning. Extensive assessments demonstrate that BayLing achieves comparable performance to GPT-3.5-turbo, despite utilizing a considerably smaller parameter size of only 13 billion. Experimental results on translation tasks show that BayLing achieves 95% of single-turn translation capability compared to GPT-4 with automatic evaluation and 96% of interactive translation capability compared to GPT-3.5-turbo with human evaluation. To estimate the performance on general tasks, we created a multi-turn instruction test set called BayLing-80. The experimental results on BayLing-80 indicate that BayLing achieves 89% of performance compared to GPT-3.5-turbo. BayLing also demonstrates outstanding performance on knowledge assessment of Chinese GaoKao and English SAT, second only to GPT-3.5-turbo among a multitude of instruction-following LLMs. Demo, homepage, code and models of BayLing are available.

BayLing: 大規模言語モデルのためのインタラクティブ翻訳を通じた言語間アライメントと指示追従の橋渡し

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

要旨

Support