超越长度缩放:融合广度与深度构建生成式奖励模型
Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
March 2, 2026
作者: Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma
cs.AI
摘要
近期生成式奖励模型(GRM)的研究进展表明,扩展思维链(CoT)推理长度能显著提升评估可靠性。然而现有研究主要依赖非结构化长度扩展,忽视了不同推理机制的效果差异:广度思维链(B-CoT,即多维度原则覆盖)与深度思维链(D-CoT,即实质性判断严谨性)。为此,我们提出Mix-GRM框架,通过模块化合成流程将原始理据重构为结构化B-CoT与D-CoT,继而采用监督微调(SFT)和可验证奖励强化学习(RLVR)来内化并优化这些机制。全面实验表明,Mix-GRM在五项基准测试中创下最新性能记录,平均超越主流开源奖励模型8.2%。研究结果揭示出明显的推理机制分化:B-CoT适用于主观偏好任务,而D-CoT在客观正确性任务中表现更优。因此,推理机制与任务类型的错配会直接导致性能下降。此外,我们发现RLVR具有开关放大器效应,会诱发模型根据任务需求自发分配推理风格的极化现象。合成数据与模型已发布于https://huggingface.co/collections/DonJoey/mix-grm,代码已开源至https://github.com/Don-Joey/Mix-GRM。
English
Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at https://huggingface.co/collections/DonJoey/mix-grm{Hugging Face}, and the code is released at https://github.com/Don-Joey/Mix-GRM{Github}.