인피니티 인스트럭트: 언어 모델 향상을 위한 명령어 선택 및 합성의 확장

초록

대형 언어 모델(LLMs)은 실제 응용 프로그램에서 강력한 성능을 보여주지만, 기존의 오픈소스 명령어 데이터셋은 수학이나 코딩과 같은 좁은 영역에 집중되어 있어 일반화를 제한하고 독점 모델과의 격차를 더욱 벌리고 있다. 이러한 격차를 해소하기 위해, 우리는 두 단계 파이프라인을 통해 LLMs의 기초 및 채팅 능력을 향상시키기 위해 설계된 고품질 명령어 데이터셋인 Infinity-Instruct를 소개한다. 1단계에서는 하이브리드 데이터 선택 기술을 사용하여 1억 개 이상의 샘플 중에서 740만 개의 고품질 기초 명령어(InfInstruct-F-7.4M)를 선별한다. 2단계에서는 명령어 선택, 진화 및 진단 필터링을 포함한 두 단계 프로세스를 통해 150만 개의 고품질 채팅 명령어(InfInstruct-G-1.5M)를 합성한다. 우리는 Mistral, LLaMA, Qwen, Yi 등 여러 오픈소스 모델을 미세 조정하여 Infinity-Instruct를 실증적으로 평가하고, 기초 및 명령어 수행 벤치마크에서 상당한 성능 향상을 관찰하며, 공식적으로 명령어 조정된 모델들을 일관적으로 능가하는 결과를 얻었다. 특히, InfInstruct-LLaMA3.1-70B는 명령어 수행 작업에서 GPT-4-0314를 8.6% 앞서며, 기초 성능에서도 비슷한 수준을 달성했다. 이러한 결과는 기초 및 채팅 훈련 간의 시너지를 강조하며, 전체적인 LLM 개발에 대한 새로운 통찰을 제공한다. 우리의 데이터셋(https://huggingface.co/datasets/BAAI/Infinity-Instruct)과 코드(https://gitee.com/li-touch/infinity-instruct)는 공개적으로 제공되었다.

English

Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our datasethttps://huggingface.co/datasets/BAAI/Infinity-Instruct and codeshttps://gitee.com/li-touch/infinity-instruct have been publicly released.

인피니티 인스트럭트: 언어 모델 향상을 위한 명령어 선택 및 합성의 확장

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

초록

Support