BayLing:通過互動式翻譯實現大型語言模型的跨語言對齊和指示遵循
BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models
June 19, 2023
作者: Shaolei Zhang, Qingkai Fang, Zhuocheng Zhang, Zhengrui Ma, Yan Zhou, Langlin Huang, Mengyu Bu, Shangtong Gui, Yunji Chen, Xilin Chen, Yang Feng
cs.AI
摘要
大型語言模型(LLMs)展現出卓越的語言理解和生成能力。從基礎LLMs發展到遵循指示的LLMs,指示調整在對齊LLMs與人類偏好方面發揮著至關重要的作用。然而,現有的LLMs通常專注於英語,導致在非英語語言方面表現較差。為了提高非英語語言的性能,有必要為基礎LLMs收集特定語言的訓練數據並構建特定語言的指示以進行指示調整,這兩者都是繁重的工作。為了減少人類工作量,我們提出通過互動翻譯任務將語言生成和指示遵循的能力從英語轉移到其他語言。我們開發了BayLing,一個利用LLaMA作為基礎LLM並自動構建互動翻譯指示以進行指示調整的遵循指示LLM。廣泛的評估表明,儘管參數大小僅為130億,BayLing實現了與GPT-3.5-turbo相當的性能。在翻譯任務的實驗結果顯示,BayLing在自動評估中實現了相當於GPT-4的單輪翻譯能力的95%,在人工評估中實現了相當於GPT-3.5-turbo的互動翻譯能力的96%。為了估計在一般任務上的性能,我們創建了一個名為BayLing-80的多輪指示測試集。BayLing-80的實驗結果表明,BayLing實現了相當於GPT-3.5-turbo的性能的89%。BayLing在中國高考和英語SAT的知識評估方面表現出色,僅次於眾多遵循指示的LLMs中的GPT-3.5-turbo。BayLing的演示、主頁、代碼和模型均可提供。
English
Large language models (LLMs) have demonstrated remarkable prowess in language
understanding and generation. Advancing from foundation LLMs to
instructionfollowing LLMs, instruction tuning plays a vital role in aligning
LLMs to human preferences. However, the existing LLMs are usually focused on
English, leading to inferior performance in non-English languages. In order to
improve the performance for non-English languages, it is necessary to collect
language-specific training data for foundation LLMs and construct
language-specific instructions for instruction tuning, both of which are heavy
loads. To minimize human workload, we propose to transfer the capabilities of
language generation and instruction following from English to other languages
through an interactive translation task. We have developed BayLing, an
instruction-following LLM by utilizing LLaMA as the foundation LLM and
automatically constructing interactive translation instructions for instructing
tuning. Extensive assessments demonstrate that BayLing achieves comparable
performance to GPT-3.5-turbo, despite utilizing a considerably smaller
parameter size of only 13 billion. Experimental results on translation tasks
show that BayLing achieves 95% of single-turn translation capability compared
to GPT-4 with automatic evaluation and 96% of interactive translation
capability compared to GPT-3.5-turbo with human evaluation. To estimate the
performance on general tasks, we created a multi-turn instruction test set
called BayLing-80. The experimental results on BayLing-80 indicate that BayLing
achieves 89% of performance compared to GPT-3.5-turbo. BayLing also
demonstrates outstanding performance on knowledge assessment of Chinese GaoKao
and English SAT, second only to GPT-3.5-turbo among a multitude of
instruction-following LLMs. Demo, homepage, code and models of BayLing are
available.