重新思考多領域測試時擴展的獎勵模型
Rethinking Reward Models for Multi-Domain Test-Time Scaling
October 1, 2025
作者: Dong Bok Lee, Seanie Lee, Sangwoo Park, Minki Kang, Jinheon Baek, Dongki Kim, Dominik Wagner, Jiongdao Jin, Heejun Lee, Tobias Bocklet, Jinyu Wang, Jingjing Fu, Sung Ju Hwang, Jiang Bia, Lei Song
cs.AI
摘要
大型語言模型(LLMs)在測試階段擴展時的可靠性,通常透過外部驗證器或獎勵模型來評估,這些工具能夠區分正確的推理與有缺陷的邏輯。先前的研究普遍認為,過程獎勵模型(PRMs)——對每個中間推理步驟進行評分——優於僅評估最終答案的結果獎勵模型(ORMs)。這一觀點主要基於數學相關狹窄領域的證據。我們首次對四種獎勵模型變體進行了統一評估,包括判別式ORMs和PRMs(\DisORM, \DisPRM)以及生成式ORMs和PRMs(\GenORM, \GenPRM),涵蓋了14個不同領域。與傳統觀念相反,我們發現:(i) \DisORM與\DisPRM表現相當,(ii) \GenPRM並不具競爭力,以及(iii)總體而言,\GenORM是最為穩健的,在所有測試領域中均展現出顯著且一致的優勢。我們將此歸因於PRM式的逐步評分,它繼承了LLM自動標註的標籤噪聲,並且在評估長推理軌跡(包括涉及自我修正的推理)時存在困難。我們的理論分析表明,隨著推理長度的增加,逐步聚合會放大錯誤,而我們的實證觀察也證實了這一效應。這些發現挑戰了細粒度監督總是更好的普遍假設,並支持在多領域部署中採用生成式結果驗證。我們公開了代碼、數據集和檢查點,以促進未來在多領域設置中的研究,詳見https://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}}。
English
The reliability of large language models (LLMs) during test-time scaling is
often assessed with external verifiers or reward models that
distinguish correct reasoning from flawed logic. Prior work generally assumes
that process reward models (PRMs), which score every intermediate reasoning
step, outperform outcome reward models (ORMs) that assess only the final
answer. This view is based mainly on evidence from narrow, math-adjacent
domains. We present the first unified evaluation of four reward model variants,
discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM
(\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom,
we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not
competitive, and (iii) overall, \GenORM is the most robust, yielding
significant and consistent gains across every tested domain. We attribute this
to PRM-style stepwise scoring, which inherits label noise from LLM
auto-labeling and has difficulty evaluating long reasoning trajectories,
including those involving self-correcting reasoning. Our theoretical analysis
shows that step-wise aggregation compounds errors as reasoning length grows,
and our empirical observations confirm this effect. These findings challenge
the prevailing assumption that fine-grained supervision is always better and
support generative outcome verification for multi-domain deployment. We
publicly release our code, datasets, and checkpoints at
https://github.com/db-Lee/Multi-RM{\small\texttt{https://github.com/db-Lee/Multi-RM}}
to facilitate future research in multi-domain settings.