你的群体相对优势存在偏见。
Your Group-Relative Advantage Is Biased
January 13, 2026
作者: Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, Yaodong Yang, Jianxin Li, Yikun Ban
cs.AI
摘要
基于验证器奖励的强化学习(RLVR)已成为在推理任务上对大型语言模型进行后训练时广泛采用的方法,其中以GRPO及其变体为代表的分组方法获得了普遍应用。这类方法依赖组间相对优势估计来避免学习评判器,但其理论特性仍鲜为人知。
本研究揭示了分组式强化学习的一个根本问题:组间相对优势估计量相对于真实(期望)优势存在固有偏差。我们首次通过理论分析证明,该系统性地低估困难提示的优势值,同时高估简单提示的优势值,导致探索与利用的失衡。为解决该问题,我们提出历史感知自适应难度加权(HA-DW)方案,该方案通过动态难度锚点与训练状态自适应调整优势估计权重。在五个数学推理基准上的理论分析与实验均表明,HA-DW在融入GRPO及其变体后能持续提升性能。我们的研究结果表明,修正有偏的优势估计对于实现稳健高效的RLVR训练至关重要。
English
Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood.
In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.