RUBRIC-ARROW：在不可驗證領域中針對大型語言模型後訓練的交替逐點評定量表獎勵建模

摘要

逐點獎勵建模為大型語言模型（LLM）的後訓練提供了關鍵信號，但在主觀、不可驗證的場景中難以進行絕對評分。基於評分標準的方法透過將評估分解為明確的標準來解決此問題，但現有方法通常依賴前沿大型語言模型，並因硬性布林聚合而產生平手問題。我們提出 RUBRIC-ARROW，一種交替框架，聯合訓練評分標準生成器與條件於評分標準的評判模型，其強化學習階段僅使用成對偏好數據。我們的方法結合了減少平手的概率型評分規則，以及階段特定的基於偏好的獎勵與交替 GRPO 方案，共同訓練逐點評估器。大量實驗顯示，RUBRIC-ARROW 在獎勵建模準確度上達到競爭力，並為下游策略後訓練帶來一致的增益。

English

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.