無限指令:擴展指令選擇與合成以提升語言模型效能
Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models
June 9, 2025
作者: Jijie Li, Li Du, Hanyu Zhao, Bo-wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, Yonghua Lin
cs.AI
摘要
大型語言模型(LLMs)在實際應用中展現出強大的性能,然而現有的開源指令數據集往往集中在狹窄的領域,如數學或編程,這限制了其泛化能力,並拉大了與專有模型之間的差距。為彌合這一差距,我們引入了Infinity-Instruct,這是一個高質量的指令數據集,旨在通過兩階段流程增強LLMs的基礎能力和對話能力。在第一階段,我們從超過1億個樣本中精選出740萬個高質量的基礎指令(InfInstruct-F-7.4M),採用混合數據選擇技術。在第二階段,我們通過包含指令選擇、進化和診斷過濾的兩階段過程,合成了150萬個高質量的對話指令(InfInstruct-G-1.5M)。我們通過微調多個開源模型(包括Mistral、LLaMA、Qwen和Yi)對Infinity-Instruct進行了實證評估,並觀察到在基礎和指令遵循基準測試中均取得了顯著的性能提升,持續超越官方指令微調的對應模型。值得注意的是,InfInstruct-LLaMA3.1-70B在指令遵循任務上比GPT-4-0314高出8.6%,同時在基礎性能上達到可比水平。這些結果強調了基礎訓練與對話訓練之間的協同效應,並為LLM的全面發展提供了新的見解。我們的數據集https://huggingface.co/datasets/BAAI/Infinity-Instruct和代碼https://gitee.com/li-touch/infinity-instruct已公開發布。
English
Large Language Models (LLMs) demonstrate strong performance in real-world
applications, yet existing open-source instruction datasets often concentrate
on narrow domains, such as mathematics or coding, limiting generalization and
widening the gap with proprietary models. To bridge this gap, we introduce
Infinity-Instruct, a high-quality instruction dataset designed to enhance both
foundational and chat capabilities of LLMs through a two-phase pipeline. In
Phase 1, we curate 7.4M high-quality foundational instructions
(InfInstruct-F-7.4M) from over 100M samples using hybrid data selection
techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions
(InfInstruct-G-1.5M) through a two-stage process involving instruction
selection, evolution, and diagnostic filtering. We empirically evaluate
Infinity-Instruct by fine-tuning several open-source models, including Mistral,
LLaMA, Qwen, and Yi, and observe substantial performance gains across both
foundational and instruction following benchmarks, consistently surpassing
official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B
outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving
comparable foundational performance. These results underscore the synergy
between foundational and chat training and offer new insights into holistic LLM
development. Our
datasethttps://huggingface.co/datasets/BAAI/Infinity-Instruct and
codeshttps://gitee.com/li-touch/infinity-instruct have been publicly
released.