ChatPaper.aiChatPaper

InfiAlign:一种可扩展且样本高效的大语言模型对齐框架,旨在提升推理能力

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities

August 7, 2025
作者: Shuo Cai, Su Lu, Qi Zhou, Kejing Yang, Zhijie Sang, Congkai Xie, Hongxia Yang
cs.AI

摘要

大型语言模型(LLMs)在众多复杂任务中展现了卓越的推理能力。然而,通过后续训练进一步提升这些能力仍需要大量资源,尤其是在数据和计算成本方面。尽管近期研究尝试通过精选数据来提高样本效率,但现有方法多依赖于启发式或任务特定策略,限制了其可扩展性。本研究提出了InfiAlign,一个可扩展且样本高效的后续训练框架,它结合了监督微调(SFT)与直接偏好优化(DPO),旨在对齐LLMs以增强其推理能力。InfiAlign的核心在于一个强大的数据选择流程,该流程利用多维质量指标从开源推理数据集中自动筛选高质量对齐数据。这一流程在显著减少数据需求的同时,实现了性能的大幅提升,并保持了对新数据源的扩展性。将InfiAlign应用于Qwen2.5-Math-7B-Base模型时,我们的SFT模型仅使用约12%的训练数据,便达到了与DeepSeek-R1-Distill-Qwen-7B相当的性能,并在多种推理任务中展现出强大的泛化能力。通过应用DPO,模型在数学推理任务上取得了尤为显著的进步,在AIME 24/25基准测试中平均提升了3.89%。我们的成果凸显了将原则性数据选择与全阶段后续训练相结合的有效性,为以可扩展且数据高效的方式对齐大型推理模型提供了实用解决方案。模型检查点可在https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT获取。
English
Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.
PDF53August 8, 2025