좋은 SFT는 SFT에 최적화되고, 더 나은 SFT는 강화 학습을 준비합니다.

초록

추론 LLM의 사후 훈련은 일반적으로 오프라인 SFT 단계와 온라인 강화 학습(RL) 단계로 구성된 종합적인 과정입니다. 그러나 SFT는 종종 SFT 성능만을 극대화하기 위해 분리되어 최적화됩니다. 우리는 동일한 RL 훈련 후에 더 강력한 SFT 체크포인트로 초기화된 모델이 더 약한 체크포인트로 초기화된 모델보다 성능이 현저히 떨어질 수 있음을 보여줍니다. 우리는 이를 현재 SFT-RL 파이프라인에서 일반적으로 나타나는 불일치, 즉 오프라인 SFT 데이터를 생성하는 분포가 자체 롤아웃으로부터 학습하는 온라인 RL 동안 최적화되는 정책과 크게 다를 수 있기 때문으로 분석합니다. 우리는 이러한 불일치를 수정하고 RL을 위해 모델을 더 잘 준비시키는 SFT 단계 방법인 PEAR(정책 평가 기반 오프라인 학습 손실 재가중 알고리즘)를 제안합니다. PEAR는 중요도 샘플링을 사용하여 SFT 손실을 재가중하며, 토큰, 블록, 시퀀스 수준에서 동작하는 세 가지 변형이 있습니다. 이는 표준 SFT 목적함수를 보완하는 데 사용될 수 있으며, 오프라인 데이터에 대한 확률이 수집되면 추가적인 훈련 오버헤드가 거의 발생하지 않습니다. 우리는 Qwen 2.5/3 및 DeepSeek-distilled 모델을 대상으로 검증 가능한 추론 게임과 수학적 추론 과제에 대한 통제 실험을 수행했습니다. PEAR는 표준 SFT 대비 RL 이후 성능을 지속적으로 향상시켰으며, AIME2025에서 8개 통과 기준 최대 14.6%의 성능 향상을 달성했습니다. 우리의 결과는 PEAR가 SFT를 분리된 것이 아닌 하위 단계 RL을 염두에 두고 설계 및 평가함으로써 더 종합적인 LLM 사후 훈련으로 나아가는 효과적인 단계임을 시사합니다.

English

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

좋은 SFT는 SFT에 최적화되고, 더 나은 SFT는 강화 학습을 준비합니다.

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

초록

Support