大規模言語モデルのポストトレーニングに関する統一的な視点に向けて

要旨

現代の言語モデルのポストトレーニングにおける主要なトレーニングデータのソースは2つ存在する：オンライン（モデル生成のロールアウト）データと、オフライン（人間または他のモデルのデモンストレーション）データである。これら2種類のデータは、通常、強化学習（RL）と教師ありファインチューニング（SFT）といったアプローチでそれぞれ使用される。本論文では、これらのアプローチが矛盾するものではなく、単一の最適化プロセスのインスタンスであることを示す。我々は統一されたポリシー勾配推定器を導出し、さまざまなデータ分布の仮定とバイアス-分散のトレードオフの下で、幅広いポストトレーニングアプローチの計算を共通の目的関数の勾配として提示する。この勾配推定器は、安定化マスク、参照ポリシーの分母、アドバンテージ推定、および尤度勾配という4つの交換可能な部分で構成されている。我々の理論的発見に基づき、異なるトレーニング信号を動的に選択するハイブリッドポストトレーニング（HPT）アルゴリズムを提案する。HPTは、学習された推論パターンを犠牲にすることなく、デモンストレーションの効果的な活用と安定した探索の両方を実現するように設計されている。我々は、統一された理論的フレームワークとHPTの有効性を検証するために、広範な実験とアブレーション研究を提供する。6つの数学的推論ベンチマークと2つの分布外スイートにおいて、HPTはさまざまなスケールとファミリーのモデルにわたって強力なベースラインを一貫して上回る。

English

Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

大規模言語モデルのポストトレーニングに関する統一的な視点に向けて

Towards a Unified View of Large Language Model Post-Training

要旨

Support