無限指令：擴展指令選擇與合成以提升語言模型效能

摘要

大型語言模型（LLMs）在實際應用中展現出強大的性能，然而現有的開源指令數據集往往集中在狹窄的領域，如數學或編程，這限制了其泛化能力，並拉大了與專有模型之間的差距。為彌合這一差距，我們引入了Infinity-Instruct，這是一個高質量的指令數據集，旨在通過兩階段流程增強LLMs的基礎能力和對話能力。在第一階段，我們從超過1億個樣本中精選出740萬個高質量的基礎指令（InfInstruct-F-7.4M），採用混合數據選擇技術。在第二階段，我們通過包含指令選擇、進化和診斷過濾的兩階段過程，合成了150萬個高質量的對話指令（InfInstruct-G-1.5M）。我們通過微調多個開源模型（包括Mistral、LLaMA、Qwen和Yi）對Infinity-Instruct進行了實證評估，並觀察到在基礎和指令遵循基準測試中均取得了顯著的性能提升，持續超越官方指令微調的對應模型。值得注意的是，InfInstruct-LLaMA3.1-70B在指令遵循任務上比GPT-4-0314高出8.6%，同時在基礎性能上達到可比水平。這些結果強調了基礎訓練與對話訓練之間的協同效應，並為LLM的全面發展提供了新的見解。我們的數據集https://huggingface.co/datasets/BAAI/Infinity-Instruct和代碼https://gitee.com/li-touch/infinity-instruct已公開發布。

English

Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our datasethttps://huggingface.co/datasets/BAAI/Infinity-Instruct and codeshttps://gitee.com/li-touch/infinity-instruct have been publicly released.

無限指令：擴展指令選擇與合成以提升語言模型效能

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

摘要

Support