ChatPaper.aiChatPaper

超越長度縮放:生成式獎勵模型的廣度與深度協同

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

March 2, 2026
作者: Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma
cs.AI

摘要

近期生成式獎勵模型(GRM)的研究進展表明,擴展思維鏈(CoT)推理的長度能顯著提升評估的可靠性。然而,現有研究主要依賴非結構化的長度擴展策略,忽略了不同推理機制的差異化效能:廣度思維鏈(B-CoT,即多維度原則覆蓋)與深度思維鏈(D-CoT,即實質判斷嚴謹性)。為此,我們提出Mix-GRM框架,透過模組化合成流程將原始推理重構為結構化的B-CoT與D-CoT,並結合監督微調(SFT)與可驗證獎勵的強化學習(RLVR)來內化與優化這些機制。全面實驗證實,Mix-GRM在五項基準測試中創下最新性能紀錄,平均超越主流開源獎勵模型達8.2%。我們的研究揭示推理機制的明確分化:B-CoT利於主觀偏好型任務,而D-CoT擅長客觀正確性任務。若推理機制與任務類型錯配,將直接導致性能下降。此外,我們發現RLVR具有開關放大器效應,會誘發模型根據任務需求自發分配推理風格的極化現象。合成數據與模型已發佈於https://huggingface.co/collections/DonJoey/mix-grm,程式碼公開於https://github.com/Don-Joey/Mix-GRM。
English
Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at https://huggingface.co/collections/DonJoey/mix-grm{Hugging Face}, and the code is released at https://github.com/Don-Joey/Mix-GRM{Github}.
PDF332May 8, 2026