RUBRIC-ARROW:非可验证领域中大语言模型后训练的交替逐点评分量表奖励建模
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
May 27, 2026
作者: Haoxiang Jiang, Zihan Dong, Tianci Liu, Wanying Wang, Ran Xu, Tony Yu, Linjun Zhang, Haoyu Wang
cs.AI
摘要
逐点奖励建模为大型语言模型的后训练提供了关键信号,但在主观、不可验证的场景中难以进行绝对评分。基于评分标准的方法通过将评估分解为显式标准来解决这一问题,但现有方法通常依赖前沿LLM,并且由于硬布尔聚合导致平局问题。我们提出RUBRIC-ARROW,一种交替框架,联合训练评分标准生成器和基于评分标准的评审器,其强化学习阶段仅使用成对偏好数据。该方法采用基于概率的评分规则(减少平局)与阶段特定的基于偏好的奖励,结合交替GRPO方案,共同训练逐点评分器。大量实验表明,RUBRIC-ARROW取得了具有竞争力的奖励建模精度,并为下游策略后训练带来持续增益。
English
Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.