关于后训练中监督微调与强化学习的不可分离性
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
January 12, 2026
作者: Xueyan Niu, Bo Bai, Wei Han, Weixi Zhang
cs.AI
摘要
大型语言模型的后训练通常交替使用监督微调(SFT)与强化学习(RL)。这两种方法具有不同目标:SFT最小化模型输出与专家响应之间的交叉熵损失,而RL最大化源自人类偏好或基于规则的验证器的奖励信号。现代推理模型已广泛采用交替进行SFT与RL训练的做法。然而,这两种方法能否解耦尚缺乏理论解释。我们证明两种顺序的解耦均不可行:(1)SFT后接RL的耦合:在SFT最优性条件下,RL会增大SFT损失;(2)RL后接SFT的耦合:SFT会降低RL已实现的奖励。在Qwen3-0.6B上的实验证实了预测的性能退化,验证了在后训练过程中若要保持已有性能,SFT与RL不可分离。
English
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training