RUBRIC-ARROW: 非検証可能領域におけるLLM後訓練のための交互点別ルーブリック報酬モデリング

要旨

ポイントワイズ報酬モデリングは、LLMの事後学習において重要な信号を提供する一方で、主観的で検証不可能な設定における絶対スコアリングに課題を抱えている。ルーブリックベースの手法は、評価を明示的な基準に分解することでこの問題に対処するが、既存のアプローチは通常、最先端LLMに依存し、かつハードなブール集約による同点問題に悩まされる。我々は、ルーブリック生成器とルーブリック条件付き評価器を交互に共同学習し、そのRL段階ではペアワイズ嗜好データのみを使用する交互フレームワーク「RUBRIC-ARROW」を提案する。本手法は、同点を低減する確率ベースのスコアリングルールとフェーズ固有の嗜好ベース報酬、およびポイントワイズ評価器を共に訓練する交互GRPO方式を組み合わせる。大規模な実験により、RUBRIC-ARROWが競争力のある報酬モデリング精度を達成し、下流の方策事後学習において一貫した向上をもたらすことを示す。

English

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.