幾何平均ポリシー最適化

要旨

近年の進歩として、Group Relative Policy Optimization（GRPO）は、トークンレベルの報酬の算術平均を最適化することで、大規模言語モデルの推論能力を向上させてきました。しかし、GRPOは、外れ値となる重要度重み付き報酬を持つトークンを処理する際に、不安定なポリシー更新に悩まされており、これはトレーニング中に極端な重要度サンプリング比（現在のポリシーと古いポリシーによってトークンに割り当てられるサンプリング確率の比）として現れます。本研究では、GRPOの安定化バリアントであるGeometric-Mean Policy Optimization（GMPO）を提案します。GMPOは算術平均ではなく、トークンレベルの報酬の幾何平均を最大化することで、外れ値に対して本質的に敏感ではなく、重要度サンプリング比の範囲をより安定させます。さらに、GMPOの設計と安定性の利点を正当化するために、包括的な理論的および実験的分析を提供します。安定性の向上に加えて、GMPO-7Bは、AIME24、AMC、MATH500、OlympiadBench、Minerva、Geometry3Kを含む複数の数学的ベンチマークで平均4.1%、マルチモーダル推論ベンチマークで1.4%の性能向上を示しています。コードはhttps://github.com/callsys/GMPOで公開されています。

English

Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.