Infinity Instruct: Het schalen van instructieselectie en -synthese om taalmodelprestaties te verbeteren

Samenvatting

Grote Taalmodellen (LLMs) tonen sterke prestaties in real-world toepassingen, maar bestaande open-source instructiedatasets richten zich vaak op smalle domeinen, zoals wiskunde of programmeren, wat de generalisatie beperkt en de kloof met propriëtaire modellen vergroot. Om deze kloof te overbruggen, introduceren we Infinity-Instruct, een hoogwaardige instructiedataset die is ontworpen om zowel de fundamentele als de chatmogelijkheden van LLMs te verbeteren via een tweefasenpijplijn. In Fase 1 hebben we 7,4 miljoen hoogwaardige fundamentele instructies (InfInstruct-F-7.4M) gecureerd uit meer dan 100 miljoen samples met behulp van hybride dataselectietechnieken. In Fase 2 hebben we 1,5 miljoen hoogwaardige chatinstructies (InfInstruct-G-1.5M) gesynthetiseerd via een tweestapsproces dat instructieselectie, evolutie en diagnostische filtering omvat. We evalueren Infinity-Instruct empirisch door verschillende open-source modellen, waaronder Mistral, LLaMA, Qwen en Yi, te finetunen, en observeren aanzienlijke prestatieverbeteringen op zowel fundamentele als instructievolgende benchmarks, waarbij consistent de officieel afgestemde tegenhangers worden overtroffen. Opmerkelijk is dat InfInstruct-LLaMA3.1-70B GPT-4-0314 met 8,6\% overtreft op instructievolgende taken, terwijl het vergelijkbare fundamentele prestaties behaalt. Deze resultaten onderstrepen de synergie tussen fundamentele en chat training en bieden nieuwe inzichten in holistische LLM-ontwikkeling. Onze dataset https://huggingface.co/datasets/BAAI/Infinity-Instruct en codes https://gitee.com/li-touch/infinity-instruct zijn openbaar vrijgegeven.

English

Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our datasethttps://huggingface.co/datasets/BAAI/Infinity-Instruct and codeshttps://gitee.com/li-touch/infinity-instruct have been publicly released.

Infinity Instruct: Het schalen van instructieselectie en -synthese om taalmodelprestaties te verbeteren

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

Samenvatting

Support