기하평균 정책 최적화

초록

최근 Group Relative Policy Optimization(GRPO)와 같은 발전은 토큰 수준 보상의 산술 평균을 최적화함으로써 대규모 언어 모델의 추론 능력을 향상시켰습니다. 그러나 GRPO는 이상치 중요도 가중치 보상을 가진 토큰을 처리할 때 불안정한 정책 업데이트 문제를 겪는데, 이는 훈련 중 극단적인 중요도 샘플링 비율(현재 정책과 이전 정책이 토큰에 할당한 샘플링 확률 간의 비율)로 나타납니다. 본 연구에서는 GRPO의 안정화된 변형인 Geometric-Mean Policy Optimization(GMPO)을 제안합니다. GMPO는 산술 평균 대신 토큰 수준 보상의 기하 평균을 최대화함으로써, 이상치에 덜 민감하고 더 안정적인 중요도 샘플링 비율 범위를 유지합니다. 또한, GMPO의 설계와 안정성 이점을 입증하기 위한 포괄적인 이론적 및 실험적 분석을 제공합니다. 안정성 개선 외에도, GMPO-7B는 AIME24, AMC, MATH500, OlympiadBench, Minerva, Geometry3K 등 여러 수학 벤치마크에서 GRPO 대비 평균 4.1%, 다중모달 추론 벤치마크에서 1.4% 더 우수한 성능을 보였습니다. 코드는 https://github.com/callsys/GMPO에서 확인할 수 있습니다.

English

Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.