Infinity Instruct: Skalierung der Instruktionsauswahl und -synthese zur Verbesserung von Sprachmodellen

papers.abstract

Große Sprachmodelle (LLMs) zeigen eine starke Leistung in realen Anwendungen, doch bestehende Open-Source-Instruktionsdatensätze konzentrieren sich oft auf enge Domänen, wie Mathematik oder Programmierung, was die Generalisierung einschränkt und die Lücke zu proprietären Modellen vergrößert. Um diese Lücke zu schließen, stellen wir Infinity-Instruct vor, einen hochwertigen Instruktionsdatensatz, der darauf abzielt, sowohl die grundlegenden als auch die Chat-Fähigkeiten von LLMs durch eine zweiphasige Pipeline zu verbessern. In Phase 1 kuratieren wir 7,4 Millionen hochwertige grundlegende Instruktionen (InfInstruct-F-7.4M) aus über 100 Millionen Proben unter Verwendung hybrider Datenauswahltechniken. In Phase 2 synthetisieren wir 1,5 Millionen hochwertige Chat-Instruktionen (InfInstruct-G-1.5M) durch einen zweistufigen Prozess, der Instruktionsauswahl, -evolution und diagnostische Filterung umfasst. Wir evaluieren Infinity-Instruct empirisch, indem wir mehrere Open-Source-Modelle, darunter Mistral, LLaMA, Qwen und Yi, feinabstimmen, und beobachten erhebliche Leistungssteigerungen sowohl bei grundlegenden als auch bei Instruktionsfolge-Benchmarks, wobei die offiziell instruktionsoptimierten Gegenstücke konsequent übertroffen werden. Insbesondere übertrifft InfInstruct-LLaMA3.1-70B GPT-4-0314 bei Instruktionsfolgeaufgaben um 8,6 %, während es eine vergleichbare grundlegende Leistung erzielt. Diese Ergebnisse unterstreichen die Synergie zwischen grundlegender und Chat-Schulung und bieten neue Einblicke in die ganzheitliche Entwicklung von LLMs. Unser Datensatz https://huggingface.co/datasets/BAAI/Infinity-Instruct und unsere Codes https://gitee.com/li-touch/infinity-instruct wurden öffentlich freigegeben.

English

Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our datasethttps://huggingface.co/datasets/BAAI/Infinity-Instruct and codeshttps://gitee.com/li-touch/infinity-instruct have been publicly released.

Infinity Instruct: Skalierung der Instruktionsauswahl und -synthese zur Verbesserung von Sprachmodellen

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

papers.abstract

Support