您的群體相對優勢存在偏見
Your Group-Relative Advantage Is Biased
January 13, 2026
作者: Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, Yaodong Yang, Jianxin Li, Yikun Ban
cs.AI
摘要
基於驗證者獎勵的強化學習(RLVR)已成為在推理任務上對大型語言模型進行後訓練的廣泛應用方法,其中以群組為基礎的方法(如GRPO及其變體)獲得廣泛採用。這些方法依賴群組相對優勢估計來避免學習批評器,但其理論特性仍鮮為人知。
本研究揭示了群組式強化學習的根本問題:群組相對優勢估計量相對於真實(期望)優勢存在固有偏差。我們首次透過理論分析證明,該估計量會系統性地低估困難提示的優勢,同時高估簡單提示的優勢,導致探索與利用的失衡。為解決此問題,我們提出歷史感知自適應難度加權(HA-DW),這是一種根據動態難度錨點與訓練狀態調整優勢估計的自適應加權方案。在五個數學推理基準上的理論分析與實驗均表明,HA-DW在整合至GRPO及其變體時能持續提升效能。我們的結果顯示,修正偏差化的優勢估計對於實現穩健且高效的RLVR訓練至關重要。
English
Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood.
In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.