ChatPaper.aiChatPaper

InfiAlign:一個可擴展且樣本高效的框架,用於對齊大型語言模型以提升推理能力

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities

August 7, 2025
作者: Shuo Cai, Su Lu, Qi Zhou, Kejing Yang, Zhijie Sang, Congkai Xie, Hongxia Yang
cs.AI

摘要

大型語言模型(LLMs)在處理多樣複雜任務時展現了卓越的推理能力。然而,通過後續訓練來提升這些能力仍需要大量資源,尤其是在數據和計算成本方面。儘管近期研究嘗試通過精選數據來提高樣本效率,但現有方法多依賴於啟發式或任務特定策略,這限制了其可擴展性。本研究提出InfiAlign,一個可擴展且樣本高效的後續訓練框架,它結合了監督微調(SFT)與直接偏好優化(DPO),旨在對齊LLMs以增強其推理能力。InfiAlign的核心是一個強大的數據選擇流程,該流程利用多維質量指標從開源推理數據集中自動篩選高質量對齊數據。這一流程不僅顯著提升了性能,還大幅減少了數據需求,並能靈活適應新數據源。應用於Qwen2.5-Math-7B-Base模型時,我們的SFT模型僅使用約12%的訓練數據,便達到了與DeepSeek-R1-Distill-Qwen-7B相當的性能,並在多樣推理任務中展現出良好的泛化能力。通過應用DPO,模型在數學推理任務上取得了尤為顯著的進步,在AIME 24/25基準測試中平均提升了3.89%。這些成果凸顯了將原則性數據選擇與全階段後續訓練相結合的有效性,為以可擴展且數據高效的方式對齊大型推理模型提供了實用解決方案。模型檢查點可在https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT獲取。
English
Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.
PDF83August 8, 2025