C2: 二値選好からスケーラブルなルーブリック拡張報酬モデリングを実現

要旨

ルーブリック拡張検証は、明示的な評価基準によって報酬モデルを導くことで、単一モデル検証よりも信頼性の高い判断を可能にする。しかし、既存手法の多くは高コストなルーブリック注釈を必要とするため、拡張性に課題がある。さらに、ルーブリック生成は「協調不全」に陥りやすいことが明らかになった。低品質なルーブリックは支援ではなく、報酬モデルを積極的に誤った方向に導くのである。この問題に対し、協調的コミュニケーションの原理に着想を得て、我々は報酬モデルがルーブリック生成器と批判的協調を行うフレームワーク「Cooperative yet Critical reward modeling (C²)」を提案する。C²では、二値選好データのみで学習したルーブリック生成器と、報酬モデルが批判的検証機能を備えることで、判断の大幅な改善を実現する。具体的には、各ルーブリックが報酬モデルの判断を正しい選好に近づけるか遠ざけるかを測定し、支援的ルーブリックと誤導的ルーブリックの対照ペアを合成する。これらの対照ペアを用いて、支援的ルーブリックを提案する協調的生成器と、ルーブリックの有効性を評価する批判的検証器を学習する。推論時には、検証器が有効と判断したルーブリックのみに従って最終判断を行う。C²は、同じ二値選好データで学習した推論型報酬モデルを上回り、RM-Benchで最大6.5ポイント、AlpacaEval 2.0の長さ調整済み勝率で6.0ポイントの性能向上を達成した。外部のルーブリック注釈なしで、8Bパラメータの報酬モデルが4倍大規模なモデルから得たルーブリックを用いた場合と同等の性能を発揮する。本研究成果は、ルーブリック拡張検証において意図的協調を引き出すことで、拡張性を維持しつつ報酬モデルの信頼性を高められることを実証する。

English

Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4times larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.

C2: 二値選好からスケーラブルなルーブリック拡張報酬モデリングを実現

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

要旨

Support