結果の正確性だけでは不十分：報酬モデルの推論プロセスを整合させる

要旨

生成的報酬モデル（GenRM）とLLM裁判官は、誤った理由で正しい判断を生成するという欺瞞的アライメントを示す。これらは結果精度を優先するよう訓練・評価されるため、RLHFにおける一般化能力が損なわれる。我々は、モデルの推論プロセスと人間の判断の整合性を定量化する詳細指標「論理的一貫性（Rationale Consistency）」を提案する。最先端モデルの評価により、結果精度が両面で不十分である一方で、論理的一貫性がモデル間の識別と欺瞞的アライメントの検出に有効であることを明らかにした。この課題を解決するため、論理的一貫性と結果精度を組み合わせたハイブリッド信号をGenRM訓練に導入する。提案手法はRM-Bench（87.1%）とJudgeBench（82%）で最高精度を達成し、結果精度のみのベースラインを平均5%上回った。RLHFにおけるRM適用では、Arena Hard v2による検証で性能向上が確認され、特に創造的作文タスクで7%の改善を達成した。詳細分析により、提案手法が欺瞞的アライメントの罠を回避し、結果精度のみの訓練で観測された論理的一貫性の低下を効果的に逆転させることを確認した。

English

Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.

結果の正確性だけでは不十分：報酬モデルの推論プロセスを整合させる

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

要旨

Support