几何平均策略优化

摘要

近期，如群体相对策略优化（GRPO）等进展，通过优化词元级奖励的算术平均值，显著提升了大型语言模型的推理能力。然而，GRPO在处理具有异常重要性加权奖励的词元时，存在策略更新不稳定的问题，这表现为训练期间重要性采样比率（即当前策略与旧策略对某词元采样概率之比）的极端波动。本研究中，我们提出了几何平均策略优化（GMPO），作为GRPO的稳定化变体。GMPO不再优化算术平均，而是最大化词元级奖励的几何平均，这一方法天生对异常值不敏感，并能维持更为稳定的重要性采样比率范围。此外，我们提供了全面的理论与实验分析，以论证GMPO设计的合理性及其稳定性优势。除了稳定性提升外，GMPO-7B在多个数学基准测试中平均超越GRPO达4.1%，在多模态推理基准测试中也有1.4%的提升，涵盖AIME24、AMC、MATH500、OlympiadBench、Minerva及Geometry3K等数据集。代码已发布于https://github.com/callsys/GMPO。

English

Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.