結果準確性不足：對齊獎勵模型的推理過程

摘要

生成式獎勵模型（GenRMs）與「LLM即評判」機制表現出虛假對齊現象——它們因錯誤的理由產生正確判斷，這是由於其訓練與評估過度側重結果準確性，從而削弱了在RLHF過程中的泛化能力。我們提出「理據一致性」這一細粒度指標，用於量化模型推理過程與人類判斷的對齊程度。對前沿模型的評估表明，理據一致性能有效區分頂尖模型並檢測虛假對齊，而結果準確性在兩方面均存在不足。為彌合此差距，我們引入結合理據一致性與結果準確性的混合信號用於GenRM訓練。我們的訓練方法在RM-Bench（87.1%）和JudgeBench（82%）上達到頂尖性能，較僅基於結果的基線平均提升5%。在RLHF過程中應用該獎勵模型時，我們的方法顯著提升表現（如Arena Hard v2測驗中創意寫作任務提高7%）。進一步分析證實，該方法能逃離虛假對齊陷阱，有效逆轉純結果訓練中觀察到的理據一致性下降趨勢。

English

Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.

結果準確性不足：對齊獎勵模型的推理過程

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

摘要

Support