从强化学习视角看SFT的泛化性:基于奖励修正的研究
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
August 7, 2025
作者: Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang
cs.AI
摘要
我们提出了一种简单但理论驱动的改进方法,用于提升大型语言模型(LLM)的监督微调(SFT),以解决其与强化学习(RL)相比泛化能力有限的问题。通过数学分析,我们发现标准SFT梯度隐含了一种可能严重限制模型泛化能力的问题奖励结构。为此,我们提出了动态微调(DFT),通过根据每个标记的概率动态调整目标函数,稳定每个标记的梯度更新。值得注意的是,这一单行代码的改动在多个具有挑战性的基准测试和基础模型上显著优于标准SFT,展现了极大的泛化能力提升。此外,我们的方法在离线RL设置中也表现出竞争力,提供了一种有效且更简单的替代方案。这项工作将理论洞见与实践解决方案相结合,大幅提升了SFT的性能。代码将在https://github.com/yongliang-wu/DFT 提供。
English
We present a simple yet theoretically motivated improvement to Supervised
Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited
generalization compared to reinforcement learning (RL). Through mathematical
analysis, we reveal that standard SFT gradients implicitly encode a problematic
reward structure that may severely restrict the generalization capabilities of
model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing
gradient updates for each token by dynamically rescaling the objective function
with the probability of this token. Remarkably, this single-line code change
significantly outperforms standard SFT across multiple challenging benchmarks
and base models, demonstrating greatly improved generalization. Additionally,
our approach shows competitive results in offline RL settings, offering an
effective yet simpler alternative. This work bridges theoretical insight and
practical solutions, substantially advancing SFT performance. The code will be
available at https://github.com/yongliang-wu/DFT.