几何平均策略优化
Geometric-Mean Policy Optimization
July 28, 2025
作者: Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei
cs.AI
摘要
近期,如群体相对策略优化(GRPO)等进展,通过优化词元级奖励的算术平均值,显著提升了大型语言模型的推理能力。然而,GRPO在处理具有异常重要性加权奖励的词元时,存在策略更新不稳定的问题,这表现为训练期间重要性采样比率(即当前策略与旧策略对某词元采样概率之比)的极端波动。本研究中,我们提出了几何平均策略优化(GMPO),作为GRPO的稳定化变体。GMPO不再优化算术平均,而是最大化词元级奖励的几何平均,这一方法天生对异常值不敏感,并能维持更为稳定的重要性采样比率范围。此外,我们提供了全面的理论与实验分析,以论证GMPO设计的合理性及其稳定性优势。除了稳定性提升外,GMPO-7B在多个数学基准测试中平均超越GRPO达4.1%,在多模态推理基准测试中也有1.4%的提升,涵盖AIME24、AMC、MATH500、OlympiadBench、Minerva及Geometry3K等数据集。代码已发布于https://github.com/callsys/GMPO。
English
Recent advancements, such as Group Relative Policy Optimization (GRPO), have
enhanced the reasoning capabilities of large language models by optimizing the
arithmetic mean of token-level rewards. However, GRPO suffers from unstable
policy updates when processing tokens with outlier importance-weighted rewards,
which manifests as extreme importance sampling ratios during training, i.e.,
the ratio between the sampling probabilities assigned to a token by the current
and old policies. In this work, we propose Geometric-Mean Policy Optimization
(GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic
mean, GMPO maximizes the geometric mean of token-level rewards, which is
inherently less sensitive to outliers and maintains a more stable range of
importance sampling ratio. In addition, we provide comprehensive theoretical
and experimental analysis to justify the design and stability benefits of GMPO.
Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on
multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark,
including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is
available at https://github.com/callsys/GMPO.