从强化学习视角看SFT的泛化性：基于奖励修正的研究

摘要

我们提出了一种简单但理论驱动的改进方法，用于提升大型语言模型（LLM）的监督微调（SFT），以解决其与强化学习（RL）相比泛化能力有限的问题。通过数学分析，我们发现标准SFT梯度隐含了一种可能严重限制模型泛化能力的问题奖励结构。为此，我们提出了动态微调（DFT），通过根据每个标记的概率动态调整目标函数，稳定每个标记的梯度更新。值得注意的是，这一单行代码的改动在多个具有挑战性的基准测试和基础模型上显著优于标准SFT，展现了极大的泛化能力提升。此外，我们的方法在离线RL设置中也表现出竞争力，提供了一种有效且更简单的替代方案。这项工作将理论洞见与实践解决方案相结合，大幅提升了SFT的性能。代码将在https://github.com/yongliang-wu/DFT 提供。

English

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

从强化学习视角看SFT的泛化性：基于奖励修正的研究

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

摘要

Support