优质监督微调专攻微调本身，卓越监督微调则着眼强化学习铺垫。

摘要

推理大语言模型的后训练是一个整体性过程，通常包含离线监督微调（SFT）阶段和在线强化学习（RL）阶段。然而，当前SFT阶段往往被孤立优化，仅追求SFT性能最大化。我们发现，在完全相同的RL训练后，从较强SFT检查点初始化的模型性能可能显著低于从较弱检查点初始化的模型。这归因于当前SFT-RL流程中的典型错配：生成离线SFT数据的分布与在线RL阶段通过自我推演优化的策略分布存在显著差异。为此，我们提出PEAR（基于策略评估的离线学习损失重加权算法），一种在SFT阶段修正这种错配、为RL阶段更好预备模型的方法。PEAR通过重要性采样对SFT损失进行重加权，提供词元级、块级和序列级三种变体。该方法可无缝集成到标准SFT目标中，且在收集完离线数据概率后几乎不增加训练开销。我们在Qwen 2.5/3和DeepSeek蒸馏模型上进行了可验证推理游戏和数学推理任务的对照实验。PEAR始终能提升模型在RL阶段后的性能，在AIME2025基准上实现了最高14.6%的8题通过率增益。实验结果表明，通过将下游RL目标融入SFT阶段的设计与评估，PEAR为实现更整体化的大模型后训练迈出了有效一步。

English

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

优质监督微调专攻微调本身，卓越监督微调则着眼强化学习铺垫。

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

摘要

Support