SFTの一般化について：報酬補正を伴う強化学習の観点から

要旨

大規模言語モデル（LLM）の教師ありファインチューニング（SFT）に対し、理論的動機付けに基づいたシンプルかつ効果的な改善手法を提案します。本手法は、強化学習（RL）と比較した際のSFTの限定的な汎化性能を改善することを目的としています。数学的分析を通じて、標準的なSFTの勾配が問題のある報酬構造を暗黙的にエンコードしており、これがモデルの汎化能力を著しく制限していることを明らかにしました。この問題を解決するため、動的ファインチューニング（DFT）を提案します。DFTでは、各トークンの確率に基づいて目的関数を動的に再スケーリングすることで、勾配更新を安定化します。驚くべきことに、この単一行のコード変更により、複数の挑戦的なベンチマークとベースモデルにおいて標準的なSFTを大幅に上回る性能を示し、汎化性能が大きく向上することが実証されました。さらに、本手法はオフラインRL設定においても競争力のある結果を示し、効果的かつシンプルな代替手段を提供します。本研究は、理論的洞察と実践的解決策を橋渡しし、SFTの性能を大幅に向上させます。コードはhttps://github.com/yongliang-wu/DFTで公開予定です。

English

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

SFTの一般化について：報酬補正を伴う強化学習の観点から

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

要旨

Support