InfiAlign: 추론 능력 향상을 위한 LLM 정렬을 위한 확장 가능하고 샘플 효율적인 프레임워크

초록

대규모 언어 모델(LLMs)은 다양한 복잡한 작업에서 인상적인 추론 능력을 보여주고 있습니다. 그러나 사후 훈련을 통해 이러한 능력을 향상시키는 것은 여전히 데이터와 계산 비용 측면에서 많은 자원을 필요로 합니다. 최근에는 선택적 데이터 큐레이션을 통해 샘플 효율성을 개선하려는 시도가 있었지만, 기존 방법들은 휴리스틱이나 작업별 전략에 의존하는 경우가 많아 확장성을 저해하는 문제가 있습니다. 본 연구에서는 LLM의 추론 능력을 향상시키기 위해 지도 미세 조정(SFT)과 직접 선호 최적화(DPO)를 통합한 확장 가능하고 샘플 효율적인 사후 훈련 프레임워크인 InfiAlign을 소개합니다. InfiAlign의 핵심은 다차원 품질 메트릭을 사용하여 오픈소스 추론 데이터셋에서 고품질 정렬 데이터를 자동으로 큐레이션하는 강력한 데이터 선택 파이프라인입니다. 이 파이프라인은 데이터 요구량을 크게 줄이면서도 성능을 크게 향상시킬 수 있으며, 새로운 데이터 소스에도 확장 가능합니다. Qwen2.5-Math-7B-Base 모델에 적용한 결과, 우리의 SFT 모델은 DeepSeek-R1-Distill-Qwen-7B와 동등한 성능을 달성하면서도 훈련 데이터의 약 12%만 사용했으며, 다양한 추론 작업에서 강력한 일반화 능력을 보여주었습니다. DPO를 적용함으로써 추가적인 개선이 이루어졌으며, 특히 수학적 추론 작업에서 두드러진 성능 향상을 보였습니다. 이 모델은 AIME 24/25 벤치마크에서 평균 3.89%의 성능 향상을 달성했습니다. 우리의 결과는 원칙적인 데이터 선택과 전 단계 사후 훈련을 결합하는 것이 효과적임을 보여주며, 대규모 추론 모델을 확장 가능하고 데이터 효율적인 방식으로 정렬하는 실용적인 해결책을 제시합니다. 모델 체크포인트는 https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT에서 확인할 수 있습니다.

English

Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.

InfiAlign: 추론 능력 향상을 위한 LLM 정렬을 위한 확장 가능하고 샘플 효율적인 프레임워크

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities

초록

Support