SRFT: 추론을 위한 지도 및 강화 학습 기반 단일 단계 미세 조정 방법

초록

대규모 언어 모델(LLMs)은 추론 작업에서 놀라운 진전을 이루었으나, 지도 미세 조정(SFT)과 강화 학습(RL)의 최적 통합은 여전히 근본적인 과제로 남아 있다. 토큰 분포, 학습 역학, 그리고 엔트로피 기반 관점에서의 통합 메커니즘에 대한 포괄적인 분석을 통해, 우리는 이러한 패러다임 간의 주요 차이점을 밝혀냈다: SFT는 LLM 정책 분포에 대해 거시적 전역 변화를 유도하는 반면, RL은 미시적 선택적 최적화를 수행하며, 엔트로피는 훈련 효과의 중요한 지표로 작용한다. 이러한 관찰을 바탕으로, 우리는 엔트로피 인식 가중치 메커니즘을 통해 두 미세 조정 패러다임을 통합한 단일 단계 방법인 지도 강화 미세 조정(SRFT)을 제안한다. 우리의 접근 방식은 SFT와 RL을 동시에 적용하여 두 단계 순차적 방법 대신 시연과 자기 탐색 롤아웃을 통해 LLM을 직접 최적화한다. 광범위한 실험 결과, SRFT는 평균 59.1%의 정확도를 달성하며, 다섯 가지 수학적 추론 벤치마크에서 제로-RL 방법보다 9.0%, 분포 외 벤치마크 세 가지에서 10.9% 더 우수한 성능을 보였다.

English

Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.

SRFT: 추론을 위한 지도 및 강화 학습 기반 단일 단계 미세 조정 방법

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

초록

Support