C2: 이진 선호도 기반 확장 가능한 루브릭 증강 보상 모델링

초록

루브릭 기반 검증은 명시적 평가 기준을 통해 보상 모델을 안내함으로써 단일 모델 검증보다 더 신뢰할 수 있는 판단을 제공합니다. 그러나 기존 방법 대부분은 확장성을 제한하는 고비용의 루브릭 주석이 필요합니다. 더욱이 루브릭 생성은 협력 실패에 취약한 것으로 나타났으며, 저품질 루브릭은 도움이 되기보다 보상 모델을 적극적으로 오도합니다. 협력적 의사소통 원리에 착안하여, 우리는 이진 선호도만으로 학습된 루브릭 생성기와 보상 모델이 비판적으로 협업하도록 하는 C2(Cooperative yet Critical reward modeling) 프레임워크를 제안합니다. C2에서는 각 루브릭이 보상 모델의 판단을 정답 선호도에 가깝게 또는 멀어지게 만드는 정도를 측정하여 도움이 되는 루브릭과 오도하는 루브릭 쌍을 합성합니다. 이러한 대조적 쌍을 활용해 협력적 루브릭 생성기는 도움이 되는 루브릭을 제안하도록 학습되고, 비판적 검증기는 판단 전 루브릭 유효성을 평가하여 추론 시 자신이 도움이 된다고 판단되는 루브릭만 따릅니다. C2는 동일한 이진 선호도로 학습된 추론 보상 모델을 능가하며, RM-Bench에서 최대 6.5점, AlpacaEval 2.0 길이 제어 승률에서 6.0점 향상을 보였습니다. 외부 루브릭 주석 없이도 C2는 8B 보상 모델이 4배 큰 모델의 루브릭으로 달성한 성능에 맞설 수 있게 합니다. 전반적으로 우리 연구는 루브릭 기반 검증에서 의도적 협력을 이끌어냄으로써 보상 모델을 확장 가능한 방식으로 더 신뢰할 수 있게 만듦을 입증합니다.

English

Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4times larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.

C2: 이진 선호도 기반 확장 가능한 루브릭 증강 보상 모델링

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

초록

Support