InfiAlign:一個可擴展且樣本高效的框架,用於對齊大型語言模型以提升推理能力
InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities
August 7, 2025
作者: Shuo Cai, Su Lu, Qi Zhou, Kejing Yang, Zhijie Sang, Congkai Xie, Hongxia Yang
cs.AI
摘要
大型語言模型(LLMs)在處理多樣複雜任務時展現了卓越的推理能力。然而,通過後續訓練來提升這些能力仍需要大量資源,尤其是在數據和計算成本方面。儘管近期研究嘗試通過精選數據來提高樣本效率,但現有方法多依賴於啟發式或任務特定策略,這限制了其可擴展性。本研究提出InfiAlign,一個可擴展且樣本高效的後續訓練框架,它結合了監督微調(SFT)與直接偏好優化(DPO),旨在對齊LLMs以增強其推理能力。InfiAlign的核心是一個強大的數據選擇流程,該流程利用多維質量指標從開源推理數據集中自動篩選高質量對齊數據。這一流程不僅顯著提升了性能,還大幅減少了數據需求,並能靈活適應新數據源。應用於Qwen2.5-Math-7B-Base模型時,我們的SFT模型僅使用約12%的訓練數據,便達到了與DeepSeek-R1-Distill-Qwen-7B相當的性能,並在多樣推理任務中展現出良好的泛化能力。通過應用DPO,模型在數學推理任務上取得了尤為顯著的進步,在AIME 24/25基準測試中平均提升了3.89%。這些成果凸顯了將原則性數據選擇與全階段後續訓練相結合的有效性,為以可擴展且數據高效的方式對齊大型推理模型提供了實用解決方案。模型檢查點可在https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT獲取。
English
Large language models (LLMs) have exhibited impressive reasoning abilities on
a wide range of complex tasks. However, enhancing these capabilities through
post-training remains resource intensive, particularly in terms of data and
computational cost. Although recent efforts have sought to improve sample
efficiency through selective data curation, existing methods often rely on
heuristic or task-specific strategies that hinder scalability. In this work, we
introduce InfiAlign, a scalable and sample-efficient post-training framework
that integrates supervised fine-tuning (SFT) with Direct Preference
Optimization (DPO) to align LLMs for enhanced reasoning. At the core of
InfiAlign is a robust data selection pipeline that automatically curates
high-quality alignment data from open-source reasoning datasets using
multidimensional quality metrics. This pipeline enables significant performance
gains while drastically reducing data requirements and remains extensible to
new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model
achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only
approximately 12% of the training data, and demonstrates strong generalization
across diverse reasoning tasks. Additional improvements are obtained through
the application of DPO, with particularly notable gains in mathematical
reasoning tasks. The model achieves an average improvement of 3.89% on AIME
24/25 benchmarks. Our results highlight the effectiveness of combining
principled data selection with full-stage post-training, offering a practical
solution for aligning large reasoning models in a scalable and data-efficient
manner. The model checkpoints are available at
https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.