迈向大语言模型后训练的统一视角

摘要

现代语言模型的后训练主要依赖于两类训练数据来源：在线数据（模型生成的推演数据）和离线数据（人类或其他模型的示范数据）。这两类数据通常分别被强化学习（RL）和监督微调（SFT）等方法所采用。本文中，我们揭示这两种方法并非对立，而是同一优化过程的不同实例。我们推导出一个统一策略梯度估计器，并将多种后训练方法的计算展示为在不同数据分布假设及各类偏差-方差权衡下，对同一目标函数的梯度求解。该梯度估计器由四个可互换组件构成：稳定化掩码、参考策略分母、优势估计以及似然梯度。基于理论发现，我们提出了混合后训练（HPT）算法，该算法能动态选择不同的训练信号。HPT旨在实现示范数据的有效利用与稳定探索的平衡，同时不牺牲已习得的推理模式。我们通过大量实验与消融研究，验证了统一理论框架及HPT的有效性。在六个数学推理基准测试和两个分布外测试集上，HPT在不同规模和系列的模型中均显著超越了强基线表现。

English

Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

迈向大语言模型后训练的统一视角

Towards a Unified View of Large Language Model Post-Training

摘要

Support