ChatPaper.aiChatPaper

揭示隐含优势对称性:为何GRPO在探索与难度适应中举步维艰

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

February 5, 2026
作者: Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, Liangqiong Qu
cs.AI

摘要

基于可验证奖励的强化学习(RLVR),特别是GRPO方法,已成为激发大语言模型推理能力的标准范式。然而,其在探索效率与难度适应性方面的效能仍存在挑战。本研究指出,这些瓶颈源于组间相对优势估计(GRAE)中固有的隐式优势对称性。该对称性引发两个关键局限:(i)在组间层面,正确与错误轨迹间严格的权重对称性会使未采样动作的logits保持不变,从而阻碍对新颖正确解的探索;(ii)在样本层面,算法隐式优先处理中等难度样本,未能适应难度聚焦的非平稳需求。通过受控实验,我们揭示这种对称特性存在次优性,并得出两个关键发现:(i)非对称抑制正确轨迹的优势能促进必要探索;(ii)通过类课程学习策略——初始优先处理简单样本再逐步转向复杂样本——可实现学习效率最大化。基于这些发现,我们提出非对称GRAE(A-GRAE),动态调节探索激励与样本难度聚焦。在七个基准测试上的实验表明,A-GRAE能持续提升GRPO及其变体在LLM与多模态大语言模型上的性能。
English
Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage symmetry inherent in Group Relative Advantage Estimation (GRAE). This symmetry induces two critical limitations: (i) at the group level, strict symmetry in weights between correct and incorrect trajectories leaves unsampled action logits unchanged, thereby hindering exploration of novel correct solution. (ii) at the sample level, the algorithm implicitly prioritizes medium-difficulty samples, remaining agnostic to the non-stationary demands of difficulty focus. Through controlled experiments, we reveal that this symmetric property is sub-optimal, yielding two pivotal insights: (i) asymmetrically suppressing the advantages of correct trajectories encourages essential exploration. (ii) learning efficiency is maximized by a curriculum-like transition-prioritizing simpler samples initially before gradually shifting to complex ones. Motivated by these findings, we propose Asymmetric GRAE (A-GRAE), which dynamically modulates exploration incentives and sample-difficulty focus. Experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs.
PDF101February 14, 2026