ChatPaper.aiChatPaper

优质监督微调优化监督微调过程,卓越监督微调为强化学习奠定基础

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

February 1, 2026
作者: Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng
cs.AI

摘要

推理大模型的後訓練是一個整體性過程,通常包含離線監督微調(SFT)階段與線上強化學習(RL)階段的串聯。然而,當前實踐往往孤立地優化SFT階段,僅追求最大化SFT本身的性能指標。 我們發現,在經過完全相同的RL訓練後,從較強SFT檢查點初始化的模型,其最終表現可能顯著遜色於從較弱檢查點初始化的模型。這一現象根源於當前SFT-RL流程中的典型錯配:生成離線SFT數據的分佈,與線上RL階段從自身推演結果中學習的優化策略分佈存在顯著差異。 為此,我們提出PEAR(基於策略評估的離線學習損失重加權算法),一種在SFT階段校正上述錯配、為後續RL階段更好預熱模型的方法。PEAR運用重要性採樣對SFT損失進行重加權,提供詞元級、區塊級和序列級三種變體。該方法可無縫集成至標準SFT目標函數,且在收集完離線數據的概率分佈後幾乎不增加訓練開銷。 我們在Qwen 2.5/3和DeepSeek蒸餾模型上,針對可驗證推理遊戲和數學推理任務進行了對照實驗。PEAR始終能超越傳統SFT方法提升RL後模型性能,在AIME2025數據集上實現了最高14.6%的pass@8指標提升。實驗結果表明,通過在設計與評估SFT時前瞻性考慮下游RL階段而非孤立優化,PEAR為實現更協同的大模型後訓練邁出了關鍵一步。
English
Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.
PDF393February 7, 2026