BayLing：通过交互式翻译实现大型语言模型的跨语言对齐和指令遵循

摘要

大型语言模型（LLMs）展示了在语言理解和生成方面的显著能力。从基础LLMs发展到遵循指令的LLMs，指令调整在使LLMs与人类偏好保持一致方面起着至关重要的作用。然而，现有的LLMs通常专注于英语，导致在非英语语言中表现较差。为了提高非英语语言的性能，有必要为基础LLMs收集特定语言的训练数据，并构建特定语言的指令以进行指令调整，这两者都是繁重的任务。为了减少人力工作量，我们提出通过交互式翻译任务将语言生成和指令遵循的能力从英语转移到其他语言。我们开发了BayLing，一种利用LLaMA作为基础LLM并自动构建交互式翻译指令进行指令调整的LLMs。广泛的评估表明，尽管参数规模仅为130亿，BayLing的性能与GPT-3.5-turbo相当。在翻译任务的实验结果显示，BayLing在单轮翻译能力方面达到了GPT-4的95%，在交互式翻译能力方面与GPT-3.5-turbo相比达到了96%，后者经过人类评估。为了评估在通用任务上的性能，我们创建了一个名为BayLing-80的多轮指令测试集。BayLing-80的实验结果表明，BayLing相比GPT-3.5-turbo的性能达到了89%。BayLing在对中国高考和英语SAT的知识评估方面表现出色，仅次于众多遵循指令的LLMs中的GPT-3.5-turbo。BayLing的演示、主页、代码和模型均可获得。

English

Large language models (LLMs) have demonstrated remarkable prowess in language understanding and generation. Advancing from foundation LLMs to instructionfollowing LLMs, instruction tuning plays a vital role in aligning LLMs to human preferences. However, the existing LLMs are usually focused on English, leading to inferior performance in non-English languages. In order to improve the performance for non-English languages, it is necessary to collect language-specific training data for foundation LLMs and construct language-specific instructions for instruction tuning, both of which are heavy loads. To minimize human workload, we propose to transfer the capabilities of language generation and instruction following from English to other languages through an interactive translation task. We have developed BayLing, an instruction-following LLM by utilizing LLaMA as the foundation LLM and automatically constructing interactive translation instructions for instructing tuning. Extensive assessments demonstrate that BayLing achieves comparable performance to GPT-3.5-turbo, despite utilizing a considerably smaller parameter size of only 13 billion. Experimental results on translation tasks show that BayLing achieves 95% of single-turn translation capability compared to GPT-4 with automatic evaluation and 96% of interactive translation capability compared to GPT-3.5-turbo with human evaluation. To estimate the performance on general tasks, we created a multi-turn instruction test set called BayLing-80. The experimental results on BayLing-80 indicate that BayLing achieves 89% of performance compared to GPT-3.5-turbo. BayLing also demonstrates outstanding performance on knowledge assessment of Chinese GaoKao and English SAT, second only to GPT-3.5-turbo among a multitude of instruction-following LLMs. Demo, homepage, code and models of BayLing are available.

BayLing：通过交互式翻译实现大型语言模型的跨语言对齐和指令遵循

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

摘要

Support