InfiAlign: 推論能力を強化するためのLLMアライメントのためのスケーラブルでサンプル効率の良いフレームワーク

要旨

大規模言語モデル（LLM）は、幅広い複雑なタスクにおいて印象的な推論能力を示しています。しかし、これらの能力をポストトレーニングを通じて強化することは、特にデータと計算コストの面でリソース集約的です。最近の取り組みでは、選択的なデータキュレーションを通じてサンプル効率を向上させようとしていますが、既存の方法はしばしばヒューリスティックまたはタスク固有の戦略に依存しており、スケーラビリティを妨げています。本研究では、InfiAlignを紹介します。これは、教師ありファインチューニング（SFT）と直接選好最適化（DPO）を統合し、LLMの推論能力を向上させるためのスケーラブルでサンプル効率の高いポストトレーニングフレームワークです。InfiAlignの核心は、多次元品質メトリクスを使用してオープンソースの推論データセットから高品質なアライメントデータを自動的にキュレートする堅牢なデータ選択パイプラインです。このパイプラインにより、データ要件を大幅に削減しながらパフォーマンスを大幅に向上させ、新しいデータソースにも拡張可能です。Qwen2.5-Math-7B-Baseモデルに適用した場合、私たちのSFTモデルは、DeepSeek-R1-Distill-Qwen-7Bと同等のパフォーマンスを達成し、トレーニングデータの約12%しか使用せず、多様な推論タスクにおいて強い汎化能力を示します。DPOを適用することでさらなる改善が得られ、特に数学的推論タスクで顕著な向上が見られます。このモデルは、AIME 24/25ベンチマークで平均3.89%の改善を達成しました。私たちの結果は、原則に基づいたデータ選択と全段階のポストトレーニングを組み合わせることの有効性を強調し、大規模な推論モデルをスケーラブルでデータ効率の高い方法でアライメントするための実用的なソリューションを提供します。モデルのチェックポイントは、https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFTで利用可能です。

English

Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.

InfiAlign: 推論能力を強化するためのLLMアライメントのためのスケーラブルでサンプル効率の良いフレームワーク

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities

要旨

Support