超越长度缩放：融合广度与深度构建生成式奖励模型

摘要

近期生成式奖励模型（GRM）的研究进展表明，扩展思维链（CoT）推理长度能显著提升评估可靠性。然而现有研究主要依赖非结构化长度扩展，忽视了不同推理机制的效果差异：广度思维链（B-CoT，即多维度原则覆盖）与深度思维链（D-CoT，即实质性判断严谨性）。为此，我们提出Mix-GRM框架，通过模块化合成流程将原始理据重构为结构化B-CoT与D-CoT，继而采用监督微调（SFT）和可验证奖励强化学习（RLVR）来内化并优化这些机制。全面实验表明，Mix-GRM在五项基准测试中创下最新性能记录，平均超越主流开源奖励模型8.2%。研究结果揭示出明显的推理机制分化：B-CoT适用于主观偏好任务，而D-CoT在客观正确性任务中表现更优。因此，推理机制与任务类型的错配会直接导致性能下降。此外，我们发现RLVR具有开关放大器效应，会诱发模型根据任务需求自发分配推理风格的极化现象。合成数据与模型已发布于https://huggingface.co/collections/DonJoey/mix-grm，代码已开源至https://github.com/Don-Joey/Mix-GRM。

English

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at https://huggingface.co/collections/DonJoey/mix-grm{Hugging Face}, and the code is released at https://github.com/Don-Joey/Mix-GRM{Github}.

超越长度缩放：融合广度与深度构建生成式奖励模型

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

摘要

Support