GFT：从模仿学习到奖励微调——基于无偏群体优势与动态系数校正的演进

摘要

大型语言模型通常通过监督微调（SFT）和强化学习（RL）进行后训练，但如何有效统一高效知识注入与强泛化能力仍具挑战。本文通过训练动力学分析表明，SFT可视为策略梯度优化的特例：其隐含奖励极度稀疏且存在不稳定的逆概率加权，共同导致单一路径依赖、熵崩塌及梯度爆炸。基于此诊断，我们提出分组微调（GFT）——一种统一后训练框架，通过双重机制解决这些固有缺陷：分组优势学习通过构建多样化响应组并生成归一化对比监督以缓解奖励稀疏性；动态系数校正通过自适应约束逆概率权重来稳定优化过程，同时保持高效知识注入。实验表明，GFT始终优于基于SFT的方法，且产生的策略能与后续RL训练更平滑地集成。

English

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.