结果准确性不足：对齐奖励模型的推理过程

摘要

生成式奖励模型（GenRMs）与LLM即评判器（LLM-as-a-Judge）存在欺骗性对齐现象，即它们会基于错误的原因产生正确判断。这是因为其训练和评估过度强调结果准确性，从而削弱了在人类反馈强化学习（RLHF）中的泛化能力。我们提出理性一致性（Rationale Consistency）这一细粒度指标，用于量化模型推理过程与人类判断之间的对齐程度。对前沿模型的评估表明，理性一致性能有效区分顶尖模型并检测欺骗性对齐，而结果准确性在这两方面均存在不足。为弥补这一缺陷，我们引入了一种融合理性一致性与结果准确性的混合信号用于GenRM训练。我们的训练方法在RM-Bench（87.1%）和JudgeBench（82%）上达到最优性能，较仅关注结果的基线平均提升5%。在RLHF过程中使用该奖励模型时，我们的方法显著提升了性能——如Arena Hard v2测试所示，创意写作任务中尤为突出地实现了7%的提升。进一步分析证实，该方法成功规避了欺骗性对齐陷阱，有效扭转了纯结果训练中理性一致性的下降趋势。

English

Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.

结果准确性不足：对齐奖励模型的推理过程

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

摘要

Support