C2：基于二元偏好的可扩展式规则增强奖励模型构建

摘要

基于评分标准的增强验证通过明确的评估准则指导奖励模型，相比单一模型验证能产生更可靠的判断。然而现有方法大多依赖昂贵的评分标注，限制了可扩展性。我们发现评分标准生成存在协作失效风险：低质量评分标准会误导而非帮助奖励模型。受合作性交流原则启发，我们提出协同批判式奖励建模框架（C²），通过让奖励模型与仅基于二元偏好训练的评分生成器开展批判性协作，显著提升判断质量。在C²框架中，我们通过测量每个评分标准使奖励模型趋近或偏离正确偏好的程度，构建具有误导性与帮助性的对比评分对。利用这些对比对，我们训练协同式评分生成器提出有效准则，并训练批判性验证器在决策前评估评分标准的有效性——推理阶段仅采纳被判定为有益的评分标准。C²在相同二元偏好数据上训练的推理奖励模型中表现优异，在RM-Bench上提升达6.5分，在AlpacaEval 2.0长度控制胜率上提升6.0分。无需外部评分标注，C²使80亿参数奖励模型达到了4倍参数量模型使用评分标准时的性能。本研究证明，在评分标准增强验证中激发深度协作，能以可扩展方式使奖励模型变得更可信赖。

English

Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4times larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.