BayLing: 대규모 언어 모델을 위한 상호작용적 번역을 통한 교차 언어 정렬과 명령어 수행의 연결

초록

대규모 언어 모델(LLM)은 언어 이해와 생성 분야에서 놀라운 역량을 보여주고 있습니다. 기초 LLM에서 명령어 수행 LLM으로 발전하는 과정에서, 명령어 튜닝은 LLM을 인간의 선호에 맞추는 데 중요한 역할을 합니다. 그러나 기존의 LLM은 주로 영어에 초점이 맞춰져 있어, 비영어권 언어에서는 성능이 떨어지는 문제가 있습니다. 비영어권 언어의 성능을 개선하기 위해서는 기초 LLM을 위한 언어별 학습 데이터를 수집하고, 명령어 튜닝을 위한 언어별 명령어를 구성해야 하는데, 이는 상당한 부담이 됩니다. 이러한 인간의 작업 부담을 최소화하기 위해, 우리는 상호작용적 번역 작업을 통해 영어에서 다른 언어로의 언어 생성 및 명령어 수행 능력을 전이하는 방법을 제안합니다. 우리는 LLaMA를 기초 LLM으로 활용하고, 명령어 튜닝을 위한 상호작용적 번역 명령어를 자동으로 구성하여 BayLing이라는 명령어 수행 LLM을 개발했습니다. 광범위한 평가 결과, BayLing은 130억 개의 상대적으로 작은 파라미터 크기를 사용함에도 불구하고 GPT-3.5-turbo와 비슷한 성능을 달성했습니다. 번역 작업에 대한 실험 결과, BayLing은 자동 평가에서 GPT-4 대비 95%의 단일 턴 번역 능력을, 인간 평가에서는 GPT-3.5-turbo 대비 96%의 상호작용적 번역 능력을 보여주었습니다. 일반 작업에 대한 성능을 평가하기 위해, 우리는 BayLing-80이라는 다중 턴 명령어 테스트 세트를 생성했습니다. BayLing-80에 대한 실험 결과, BayLing은 GPT-3.5-turbo 대비 89%의 성능을 달성했습니다. 또한 BayLing은 중국의 고등학교 졸업 시험(GaoKao)과 미국의 SAT 지식 평가에서도 뛰어난 성능을 보여, 다양한 명령어 수행 LLM 중 GPT-3.5-turbo에 이어 두 번째로 높은 성적을 기록했습니다. BayLing의 데모, 홈페이지, 코드 및 모델은 공개되어 있습니다.

English

Large language models (LLMs) have demonstrated remarkable prowess in language understanding and generation. Advancing from foundation LLMs to instructionfollowing LLMs, instruction tuning plays a vital role in aligning LLMs to human preferences. However, the existing LLMs are usually focused on English, leading to inferior performance in non-English languages. In order to improve the performance for non-English languages, it is necessary to collect language-specific training data for foundation LLMs and construct language-specific instructions for instruction tuning, both of which are heavy loads. To minimize human workload, we propose to transfer the capabilities of language generation and instruction following from English to other languages through an interactive translation task. We have developed BayLing, an instruction-following LLM by utilizing LLaMA as the foundation LLM and automatically constructing interactive translation instructions for instructing tuning. Extensive assessments demonstrate that BayLing achieves comparable performance to GPT-3.5-turbo, despite utilizing a considerably smaller parameter size of only 13 billion. Experimental results on translation tasks show that BayLing achieves 95% of single-turn translation capability compared to GPT-4 with automatic evaluation and 96% of interactive translation capability compared to GPT-3.5-turbo with human evaluation. To estimate the performance on general tasks, we created a multi-turn instruction test set called BayLing-80. The experimental results on BayLing-80 indicate that BayLing achieves 89% of performance compared to GPT-3.5-turbo. BayLing also demonstrates outstanding performance on knowledge assessment of Chinese GaoKao and English SAT, second only to GPT-3.5-turbo among a multitude of instruction-following LLMs. Demo, homepage, code and models of BayLing are available.

BayLing: 대규모 언어 모델을 위한 상호작용적 번역을 통한 교차 언어 정렬과 명령어 수행의 연결

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

초록

Support