GRPO中策略差异度量的统一框架再思考
A Unified Framework for Rethinking Policy Divergence Measures in GRPO
February 5, 2026
作者: Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gallé, Chao Huang
cs.AI
摘要
基于验证奖励的强化学习(RLVR)已成为提升大型语言模型推理能力的关键范式。现有RLVR方法(如GRPO及其变体)大多通过限制策略差异(采用似然比截断方式)来确保稳定更新。本文提出一个统一的截断框架,通过策略差异的广义概念来刻画现有方法——该框架不仅涵盖似然比与KL散度,还可扩展至其他度量指标。该框架为系统分析不同策略差异度量如何影响探索行为与模型性能提供了理论基石。我们进一步提出KL3估计量(一种方差缩减的KL散度蒙特卡洛估计量)作为关键策略差异约束,并从理论上证明基于KL3的约束在数学上等效于非对称比率截断,这种截断会将概率质量重新分配至高置信度动作,在保持GRPO类方法简洁性的同时增强探索能力。数学推理基准测试表明,将KL3估计量融入GRPO能同时提升训练稳定性与最终性能,印证了策略优化中原则性差异约束的重要性。
English
Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.