論SFT之泛化:基於獎勵校正的強化學習視角
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
August 7, 2025
作者: Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang
cs.AI
摘要
我们提出了一种简单却理论驱动的方法,以改进大型语言模型(LLM)的监督微调(SFT),解决其与强化学习(RL)相比泛化能力有限的问题。通过数学分析,我们发现标准SFT梯度隐含地编码了一种可能严重限制模型泛化能力的有问题的奖励结构。为了纠正这一点,我们提出了动态微调(DFT),通过根据每个词元的概率动态调整目标函数,稳定每个词元的梯度更新。值得注意的是,这一单行代码的改动在多个具有挑战性的基准测试和基础模型上显著优于标准SFT,展示了极大的泛化能力提升。此外,我们的方法在离线RL设置中也显示出具有竞争力的结果,提供了一个有效且更简单的替代方案。这项工作将理论洞见与实际解决方案相结合,显著提升了SFT的性能。代码将在https://github.com/yongliang-wu/DFT 提供。
English
We present a simple yet theoretically motivated improvement to Supervised
Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited
generalization compared to reinforcement learning (RL). Through mathematical
analysis, we reveal that standard SFT gradients implicitly encode a problematic
reward structure that may severely restrict the generalization capabilities of
model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing
gradient updates for each token by dynamically rescaling the objective function
with the probability of this token. Remarkably, this single-line code change
significantly outperforms standard SFT across multiple challenging benchmarks
and base models, demonstrating greatly improved generalization. Additionally,
our approach shows competitive results in offline RL settings, offering an
effective yet simpler alternative. This work bridges theoretical insight and
practical solutions, substantially advancing SFT performance. The code will be
available at https://github.com/yongliang-wu/DFT.