GFT：模倣学習から報酬ファインチューニングへ――不偏的なグループ優位性と動的係数補正によるアプローチ

要旨

大規模言語モデルは通常、教師ありファインチューニング（SFT）と強化学習（RL）を用いた事後学習が行われるが、効率的な知識注入とロバストな汎化性能を統合することは依然として課題である。本研究では、訓練ダイナミクス分析を通じて、SFTが極めて疎な暗黙的報酬と不安定な逆確率重み付けを伴う方策勾配最適化の特殊ケースと解釈できることを示す。これらが組み合わさることで、単一路依存性、エントロピー崩壊、勾配爆発が生じる。この分析に基づき、我々はGroup Fine-Tuning（GFT）を提案する。これは2つのメカニズムを通じてこれらの本質的限界に対処する統一的事後学習フレームワークである：多様な応答グループを構築し正規化された対照的監督を導出することで報酬の疎性を緩和する「Group Advantage Learning」と、逆確率重みを適応的に制限することで効率的な知識注入を維持しつつ最適化を安定化する「Dynamic Coefficient Rectification」である。実験結果から、GFTがSFTベースの手法を一貫して上回り、後続のRL訓練との統合がより円滑な方策を生み出すことが実証された。

English

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

GFT：模倣学習から報酬ファインチューニングへ――不偏的なグループ優位性と動的係数補正によるアプローチ

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

要旨

Support