RUBRIC-ARROW：非可验证领域中大语言模型后训练的交替逐点评分量表奖励建模

摘要

逐点奖励建模为大型语言模型的后训练提供了关键信号，但在主观、不可验证的场景中难以进行绝对评分。基于评分标准的方法通过将评估分解为显式标准来解决这一问题，但现有方法通常依赖前沿LLM，并且由于硬布尔聚合导致平局问题。我们提出RUBRIC-ARROW，一种交替框架，联合训练评分标准生成器和基于评分标准的评审器，其强化学习阶段仅使用成对偏好数据。该方法采用基于概率的评分规则（减少平局）与阶段特定的基于偏好的奖励，结合交替GRPO方案，共同训练逐点评分器。大量实验表明，RUBRIC-ARROW取得了具有竞争力的奖励建模精度，并为下游策略后训练带来持续增益。

English

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.