길이 스케일링을 넘어서: 생성적 보상 모델을 위한 폭과 깊이의 시너지 효과

초록

최근 생성적 보상 모델(GRM)의 발전은 사고 연쇄(CoT) 추론의 길이를 확장함으로써 평가의 신뢰성을 크게 향상시킬 수 있음을 보여주었습니다. 그러나 기존 연구들은 주로 비구조적인 길이 확장에 의존하며, 다양한 추론 메커니즘(다차원 원칙 포괄성을 의미하는 폭-사고 연쇄(B-CoT)와 실질적 판단 건전성을 의미하는 깊이-사고 연쇄(D-CoT)) 간 효과의 차이를 간과해 왔습니다. 이를 해결하기 위해 우리는 모듈식 합성 파이프라인을 통해 원시 추론을 구조화된 B-CoT와 D-CoT로 재구성하고, 지도 미세 조정(SFT) 및 검증 가능한 보상 강화 학습(RLVR)을 통해 이러한 메커니즘을 내재화 및 최적화하는 Mix-GRM 프레임워크를 제안합니다. 포괄적인 실험 결과, Mix-GRM은 5개 벤치마크에서 새로운 최첨단 성능을 달성하며 주요 오픈소스 보상 모델들을 평균 8.2% 능가하는 것으로 나타났습니다. 우리의 결과는 추론 방식의 명확한 분기를 보여주는데, B-CoT는 주관적 선호도 과제에, D-CoT는 객관적 정확성 과제에 각각 유리했습니다. 따라서 과제 특성과 추론 메커니즘이 불일치할 경우 성능이 직접적으로 저하되었습니다. 더 나아가 RLVR이 스위칭 증폭기 역할을 하여, 모델이 과제 요구에 맞게 추론 방식을 자발적으로 할당하는 현상적 극화를 유도함을 입증했습니다. 합성된 데이터와 모델은 https://huggingface.co/collections/DonJoey/mix-grm에서, 코드는 https://github.com/Don-Joey/Mix-GRM에서 공개되었습니다.

English

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at https://huggingface.co/collections/DonJoey/mix-grm{Hugging Face}, and the code is released at https://github.com/Don-Joey/Mix-GRM{Github}.

길이 스케일링을 넘어서: 생성적 보상 모델을 위한 폭과 깊이의 시너지 효과

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

초록

Support