모든 루브릭이 동등하게 가르치지는 않는다: RLVR을 위한 정책 인식 루브릭 보상

초록

검증 가능한 보상을 이용한 강화 학습은 정답 여부를 자동으로 확인할 수 있는 경우 사후 학습을 매우 효과적으로 만든다. 그러나 많은 중요한 모델 동작은 여러 질적 기준을 동시에 충족해야 한다. 루브릭 기반 보상은 프롬프트별 기준을 평가하고 이를 스칼라 보상으로 집계함으로써 이러한 상황을 처리한다. 그러나 표준적인 정적 집계 방식은 사람이 할당한 기준의 중요도를 현재 최적화 신호로서의 유용성과 혼동한다. 본 연구는 이러한 가정이 루브릭 강화 학습에서 무너짐을 보인다. 많은 중요한 기준은 이미 포화되었거나 현재 도달 불가능한 반면, 롤아웃을 구별하는 기준이 반드시 사람 가중치가 가장 큰 기준은 아니다. 우리는 POW3R을 제안한다. 이는 정책 인식 루브릭 보상 프레임워크로, 루브릭 목표로서 사람 가중치와 범주 균형을 유지하면서 학습 중 기준 수준 보상 가중치를 적응적으로 조정한다. POW3R은 롤아웃 수준 대비를 활용하여 현재 정책의 출력을 구분하는 기준을 강조함으로써, 기본 평가 대상을 변경하지 않고 GRPO 보상을 더 유용한 정보로 만든다. 다중 모드 및 텍스트 전용 설정을 포괄하는 두 데이터셋에 대한 세 가지 기본 정책 실험에서, POW3R은 30개 중 24개의 기본 정책/지표 비교에서 승리했다. 이는 기본 GRPO에 루브릭 보상을 적용한 방식보다 평균 루브릭 보상과 엄격한 완료(각 프롬프트의 응답이 요구된 모든 루브릭 기준을 충족하는 비율) 모두를 개선했으며, 동일한 평탄화에 도달하는 데 2.5~4배 적은 학습 단계가 소요되었다. 따라서 루브릭 보상은 최종 답변에서 중요해야 할 것과 현재 정책을 가르칠 수 있는 것을 구별해야 한다.

English

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins 24 of 30 base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in 2.5--4times fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.