重新思考多领域测试时扩展的奖励模型

摘要

大型语言模型（LLMs）在测试阶段扩展时的可靠性，通常通过外部验证器或奖励模型来评估，这些工具能够区分正确的推理与存在逻辑缺陷的过程。以往的研究普遍认为，过程奖励模型（PRMs）——对每个中间推理步骤进行评分——优于仅评估最终答案的结果奖励模型（ORMs）。这一观点主要基于数学相关狭窄领域的证据。我们首次对四种奖励模型变体进行了统一评估，包括判别式ORMs和PRMs（\DisORM, \DisPRM）以及生成式ORMs和PRMs（\GenORM, \GenPRM），覆盖了14个多样化领域。与普遍看法相反，我们发现：(i) \DisORM与\DisPRM表现相当，(ii) \GenPRM并不具备竞争力，以及(iii) 总体而言，\GenORM最为稳健，在所有测试领域中均展现出显著且一致的性能提升。我们将此归因于PRM式的逐步评分方法，它继承了LLM自动标注带来的标签噪声，并且在评估长推理轨迹（包括涉及自我修正的推理）时存在困难。我们的理论分析表明，随着推理长度的增加，逐步聚合会放大错误，而我们的实证观察也证实了这一效应。这些发现挑战了精细监督总是更优的普遍假设，并支持在多领域部署中采用生成式结果验证。我们公开了代码、数据集和检查点，以促进未来在多领域设置中的研究，访问地址为：https://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}}。

English

The reliability of large language models (LLMs) during test-time scaling is often assessed with external verifiers or reward models that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM (\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not competitive, and (iii) overall, \GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at https://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}} to facilitate future research in multi-domain settings.

重新思考多领域测试时扩展的奖励模型

Rethinking Reward Models for Multi-Domain Test-Time Scaling

摘要

Support