ChatPaper.aiChatPaper

几何平均策略优化

Geometric-Mean Policy Optimization

July 28, 2025
作者: Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei
cs.AI

摘要

近期,如群体相对策略优化(GRPO)等进展,通过优化词元级奖励的算术平均值,显著提升了大型语言模型的推理能力。然而,GRPO在处理具有异常重要性加权奖励的词元时,存在策略更新不稳定的问题,这表现为训练期间重要性采样比率(即当前策略与旧策略对某词元采样概率之比)的极端波动。本研究中,我们提出了几何平均策略优化(GMPO),作为GRPO的稳定化变体。GMPO不再优化算术平均,而是最大化词元级奖励的几何平均,这一方法天生对异常值不敏感,并能维持更为稳定的重要性采样比率范围。此外,我们提供了全面的理论与实验分析,以论证GMPO设计的合理性及其稳定性优势。除了稳定性提升外,GMPO-7B在多个数学基准测试中平均超越GRPO达4.1%,在多模态推理基准测试中也有1.4%的提升,涵盖AIME24、AMC、MATH500、OlympiadBench、Minerva及Geometry3K等数据集。代码已发布于https://github.com/callsys/GMPO。
English
Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.
PDF232July 29, 2025