ヘルダー方策最適化

要旨

Group Relative Policy Optimisation (GRPO)は、サンプリングされた複数の軌跡にわたってアドバンテージを推定することで、大規模言語モデルを強化する。しかしながら、これらの軌跡レベルのアドバンテージを方策の更新にマッピングするには、各系列内でトークンレベルの確率を集約する必要がある。このステップに固定された集約メカニズムを利用することは、アルゴリズムの適応性を根本的に制限する。経験的に、我々は重要なトレードオフを観察する：特定の固定集約は頻繁に訓練崩壊を引き起こす一方、他の集約は満足のいく性能を達成できない。この問題を解決するために、我々はヘルダー平均を介してトークンレベルの確率集約を統合する一般化された方策最適化フレームワーク、HölderPOを提案する。パラメータpを明示的に調整することで、我々のフレームワークは勾配集中と分散限界の間のトレードオフを連続的に制御する。理論的に、我々はpが大きいと勾配を集中させて疎な学習信号を増幅し、一方pが小さいと勾配分散を厳密に制限することを証明する。静的な設定ではこの集中と安定性のトレードオフを普遍的に解決できないため、我々は訓練ライフサイクル全体でpを漸進的にスケジュールする動的アニーリングアルゴリズムを用いてフレームワークを具体化する。広範な評価により、既存のベースラインよりも優れた安定性と収束性を示す。具体的には、我々のアプローチは複数の数学ベンチマークにおいて最先端の平均精度54.9%を達成し、標準的なGRPOに対して7.2%の顕著な相対的向上をもたらし、ALFWorldでは93.8%という例外的な成功率を達成する。

English

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.