ChatPaper.aiChatPaper

无限指令:扩展指令选择与合成以增强语言模型

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

June 9, 2025
作者: Jijie Li, Li Du, Hanyu Zhao, Bo-wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, Yonghua Lin
cs.AI

摘要

大型语言模型(LLMs)在实际应用中展现出强劲性能,然而现有的开源指令数据集往往局限于狭窄领域,如数学或编程,这限制了模型的泛化能力,并拉大了与专有模型之间的差距。为弥合这一差距,我们推出了Infinity-Instruct,一个旨在通过两阶段流程提升LLMs基础与对话能力的高质量指令数据集。在第一阶段,我们采用混合数据筛选技术,从超过1亿样本中精选出740万条高质量基础指令(InfInstruct-F-7.4M)。第二阶段,通过包含指令选择、进化及诊断过滤的两步过程,合成了150万条高质量对话指令(InfInstruct-G-1.5M)。我们通过微调包括Mistral、LLaMA、Qwen和Yi在内的多个开源模型,对Infinity-Instruct进行了实证评估,观察到在基础与指令跟随基准测试上均取得显著性能提升,持续超越官方指令调优版本。特别地,InfInstruct-LLaMA3.1-70B在指令跟随任务上以8.6%的优势超越GPT-4-0314,同时保持相当的基础性能。这些结果凸显了基础与对话训练之间的协同效应,为LLM的全面发展提供了新见解。我们的数据集https://huggingface.co/datasets/BAAI/Infinity-Instruct和代码https://gitee.com/li-touch/infinity-instruct已公开发布。
English
Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our datasethttps://huggingface.co/datasets/BAAI/Infinity-Instruct and codeshttps://gitee.com/li-touch/infinity-instruct have been publicly released.
PDF43June 16, 2025