ChatPaper.aiChatPaper

GFT:從模仿學習到獎勵微調——基於無偏群組優勢與動態係數校正的演進路徑

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

April 15, 2026
作者: Wangjie Gan, Miao Pan, Linbo Xi, Wenqi Zhang, Jintao Chen, Jianwei Yin, Xuhong Zhang
cs.AI

摘要

大型語言模型通常透過監督式微調(SFT)與強化學習(RL)進行後訓練,然而如何有效統一高效知識注入與穩健泛化能力仍是挑戰。本研究提出訓練動力學分析,指出SFT可視為策略梯度優化的特例:其隱含獎勵函數極度稀疏,且伴隨不穩定的逆概率加權機制,共同導致單一路徑依賴、熵值崩潰及梯度爆炸等問題。基於此診斷,我們提出群組微調(GFT)——一種統一的後訓練框架,透過雙重機制解決上述固有缺陷:首先,群組優勢學習通過構建多樣化回應群組並推導歸一化對比監督,緩解獎勵稀疏性;其次,動態係數校正透過自適應約束逆概率權重,在維持高效知識注入的同時穩定優化過程。實驗表明,GFT不僅持續超越基於SFT的方法,更能產生與後續RL訓練無縫整合的策略。
English
Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
PDF193April 22, 2026