論SFT之泛化：基於獎勵校正的強化學習視角

摘要

我们提出了一种简单却理论驱动的方法，以改进大型语言模型（LLM）的监督微调（SFT），解决其与强化学习（RL）相比泛化能力有限的问题。通过数学分析，我们发现标准SFT梯度隐含地编码了一种可能严重限制模型泛化能力的有问题的奖励结构。为了纠正这一点，我们提出了动态微调（DFT），通过根据每个词元的概率动态调整目标函数，稳定每个词元的梯度更新。值得注意的是，这一单行代码的改动在多个具有挑战性的基准测试和基础模型上显著优于标准SFT，展示了极大的泛化能力提升。此外，我们的方法在离线RL设置中也显示出具有竞争力的结果，提供了一个有效且更简单的替代方案。这项工作将理论洞见与实际解决方案相结合，显著提升了SFT的性能。代码将在https://github.com/yongliang-wu/DFT 提供。

English

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

論SFT之泛化：基於獎勵校正的強化學習視角

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

摘要

Support