YAYI 2: 다국어 오픈소스 대규모 언어 모델

초록

자연어 처리 분야의 최신 발전으로서, 대형 언어 모델(LLM)은 많은 실제 작업에서 인간 수준의 언어 이해 및 생성 능력을 달성했으며, 심지어 인공 일반 지능(AGI)으로 가는 잠재적 경로로 간주되기도 합니다. LLM 연구를 더욱 촉진하기 위해, Llama 2와 Falcon과 같은 많은 오픈소스 LLM이 최근 제안되었으며, 이들은 독점 모델과 비슷한 성능을 보여주고 있습니다. 그러나 이러한 모델들은 주로 영어 시나리오를 위해 설계되었으며, 중국어 환경에서는 낮은 성능을 보입니다. 본 기술 보고서에서는 300억 개의 파라미터를 가진 YAYI 2를 제안합니다. YAYI 2는 사전 학습 데이터 처리 파이프라인을 통해 필터링된 2.65조 개의 토큰으로 구성된 다국어 코퍼스에서 처음부터 사전 학습되었습니다. 기본 모델은 수백만 개의 지시사항을 통한 지도 미세 조정과 인간 피드백을 통한 강화 학습을 통해 인간의 가치와 정렬되었습니다. MMLU 및 CMMLU와 같은 다양한 벤치마크에서 수행된 광범위한 실험은 제안된 YAYI 2가 유사한 규모의 다른 오픈소스 모델들을 능가한다는 것을 일관되게 입증합니다.

English

As the latest advancements in natural language processing, large language models (LLMs) have achieved human-level language understanding and generation abilities in many real-world tasks, and even have been regarded as a potential path to the artificial general intelligence. To better facilitate research on LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been proposed and gained comparable performances to proprietary models. However, these models are primarily designed for English scenarios and exhibit poor performances in Chinese contexts. In this technical report, we propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback. Extensive experiments on multiple benchmarks, such as MMLU and CMMLU, consistently demonstrate that the proposed YAYI 2 outperforms other similar sized open-source models.

YAYI 2: 다국어 오픈소스 대규모 언어 모델

YAYI 2: Multilingual Open-Source Large Language Models

초록

Support