ChatPaper.aiChatPaper

並非每種評分標準都有相同的教學效果:針對RLVR的政策感知評分獎勵

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

May 19, 2026
作者: Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He
cs.AI

摘要

可驗證獎勵的強化學習使得在正確性能自動檢核時,後訓練變得極為有效。然而,許多重要的模型行為需要同時滿足多項定性標準。基於評分標準的獎勵透過對提示專屬的標準進行評分,並將其聚合為標量獎勵來處理此情境。但標準的靜態聚合方式會混淆人類賦予該標準的重要性與其作為當前優化訊號的實用性。我們證明此假設在評分標準強化學習中不成立:許多重要標準已達到飽和或當前無法觸及,而能區分模型生成結果的標準,其重要性未必對應人類賦予的最大權重。我們提出 POW3R,一個策略感知的評分標準獎勵框架,其在保留人類權重與類別平衡作為評分標準目標的同時,於訓練期間調整標準層級的獎勵權重。POW3R 利用生成結果層級的對比,凸顯當前能區分策略輸出的標準,從而使 GRPO 獎勵更具資訊性,而不改變底層評估目標。在三個基礎策略與兩個涵蓋多模態及純文字設定的數據集上,POW3R 在 30 項基礎策略/指標比較中勝出 24 項,同時提升了平均評分標準獎勵與嚴格完成率(即回應滿足所有必要評分標準的提示比例),並且在僅需原始 GRPO 搭配評分標準獎勵 2.5 至 4 倍的訓練步數內達到相同的平台期。因此,評分標準獎勵應區分最終答案中應重視的目標,與能教導當前策略的訊號。
English
Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins 24 of 30 base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in 2.5--4times fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.