ChatPaper.aiChatPaper

并非所有评分标准都同等有效:面向RLVR的策略感知评分标准奖励

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

May 19, 2026
作者: Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He
cs.AI

摘要

基于可验证奖励的强化学习使得在正确性可自动校验的情况下,后训练变得极其高效。然而,许多重要的模型行为需要同时满足多个定性标准。基于评分标准的奖励通过评估提示相关标准并将其聚合为标量奖励来应对这一场景。但标准的静态聚合会将人类为某标准赋予的重要性与其当前作为优化信号的有效性混为一谈。我们证明,这一假设在基于评分标准的强化学习框架下并不成立:许多重要标准已经饱和或当前无法触及,而能够区分轨迹展开的标准并不必然对应人类权重最大的那些。为此,我们提出了POW3R,一种策略感知的评分奖励框架,它在保留人类权重和类别平衡作为评分目标的同时,在训练过程中动态调整标准级别的奖励权重。POW3R利用轨迹展开级对比来强化当前能够区分策略输出的标准,使通用化奖励优化过程中的奖励信号更具信息性,且不改变底层评估目标。在涵盖多模态和纯文本场景的两个数据集上,基于三种基础策略的实验中,POW3R在30项基础策略/指标对比中赢得24项,相较于使用标准评分奖励的GRPO,既提升了平均评分奖励,也提高了严格完成率(即响应满足所有评分标准的提示占比),并且仅需2.5至4倍的训练步数即达到相同稳定水平。因此,基于评分标准的奖励应当区分哪些因素在最终答案中应当重要,哪些因素可用于训练当前策略。
English
Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins 24 of 30 base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in 2.5--4times fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.