C2：基于二元偏好的可扩展式量规增强奖励建模

摘要

基于评分标准的验证机制通过明确的评估标准引导奖励模型，相比单一模型验证能产生更可靠的判断。然而现有方法大多依赖成本高昂的评分标注，限制了可扩展性。此外，我们发现评分标准生成存在协作失效风险：低质量评分标准会主动误导而非帮助奖励模型。受合作性交流原则启发，我们提出协同批判式奖励建模框架（C²），通过让奖励模型与仅基于二元偏好训练的评分生成器开展批判性协作，显著提升判断质量。在C²框架中，我们通过测量每个评分标准使奖励模型趋向或偏离正确偏好的程度，构建具有误导性与帮助性的对比评分对。利用这些对比样本，我们训练协同式评分生成器提出有效标准，并训练批判性验证器在决策前评估评分有效性，最终仅采纳被判定为有益的评分标准。实验表明，C²在相同二元偏好数据上训练的表现优于推理型奖励模型，在RM-Bench上提升达6.5分，在AlpacaEval 2.0的篇幅控制胜率上提升6.0分。无需外部评分标注的情况下，C²使80亿参数奖励模型达到了4倍参数量模型使用评分标准时的性能。本研究证明，在评分增强验证中激发深度协作，能以可扩展方式使奖励模型变得更可信赖。

English

Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4times larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.

C2：基于二元偏好的可扩展式量规增强奖励建模

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

摘要

Support