重新思考多领域测试时扩展的奖励模型
Rethinking Reward Models for Multi-Domain Test-Time Scaling
October 1, 2025
作者: Dong Bok Lee, Seanie Lee, Sangwoo Park, Minki Kang, Jinheon Baek, Dongki Kim, Dominik Wagner, Jiongdao Jin, Heejun Lee, Tobias Bocklet, Jinyu Wang, Jingjing Fu, Sung Ju Hwang, Jiang Bia, Lei Song
cs.AI
摘要
大型语言模型(LLMs)在测试阶段扩展时的可靠性,通常通过外部验证器或奖励模型来评估,这些工具能够区分正确的推理与存在逻辑缺陷的过程。以往的研究普遍认为,过程奖励模型(PRMs)——对每个中间推理步骤进行评分——优于仅评估最终答案的结果奖励模型(ORMs)。这一观点主要基于数学相关狭窄领域的证据。我们首次对四种奖励模型变体进行了统一评估,包括判别式ORMs和PRMs(\DisORM, \DisPRM)以及生成式ORMs和PRMs(\GenORM, \GenPRM),覆盖了14个多样化领域。与普遍看法相反,我们发现:(i) \DisORM与\DisPRM表现相当,(ii) \GenPRM并不具备竞争力,以及(iii) 总体而言,\GenORM最为稳健,在所有测试领域中均展现出显著且一致的性能提升。我们将此归因于PRM式的逐步评分方法,它继承了LLM自动标注带来的标签噪声,并且在评估长推理轨迹(包括涉及自我修正的推理)时存在困难。我们的理论分析表明,随着推理长度的增加,逐步聚合会放大错误,而我们的实证观察也证实了这一效应。这些发现挑战了精细监督总是更优的普遍假设,并支持在多领域部署中采用生成式结果验证。我们公开了代码、数据集和检查点,以促进未来在多领域设置中的研究,访问地址为:https://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}}。
English
The reliability of large language models (LLMs) during test-time scaling is
often assessed with external verifiers or reward models that
distinguish correct reasoning from flawed logic. Prior work generally assumes
that process reward models (PRMs), which score every intermediate reasoning
step, outperform outcome reward models (ORMs) that assess only the final
answer. This view is based mainly on evidence from narrow, math-adjacent
domains. We present the first unified evaluation of four reward model variants,
discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM
(\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom,
we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not
competitive, and (iii) overall, \GenORM is the most robust, yielding
significant and consistent gains across every tested domain. We attribute this
to PRM-style stepwise scoring, which inherits label noise from LLM
auto-labeling and has difficulty evaluating long reasoning trajectories,
including those involving self-correcting reasoning. Our theoretical analysis
shows that step-wise aggregation compounds errors as reasoning length grows,
and our empirical observations confirm this effect. These findings challenge
the prevailing assumption that fine-grained supervision is always better and
support generative outcome verification for multi-domain deployment. We
publicly release our code, datasets, and checkpoints at
https://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}}
to facilitate future research in multi-domain settings.