ChatPaper.aiChatPaper

PRISM:透過流程獎勵模型引導推論 推進深度思考的前沿

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

March 3, 2026
作者: Rituraj Sharma, Weiyuan Chen, Noah Provenzano, Tu Vu
cs.AI

摘要

DEEPTHINK方法通過生成、精煉和聚合候選解決方案群體來提升推理能力,使其在複雜數學與科學任務中表現卓越。然而現有框架在推理過程中往往缺乏可靠的正確性信號,這導致了群體增強瓶頸——更深層的審議反而會放大錯誤、壓制正確的少數解,並使額外計算投入的回報遞減。本文提出對DEEPTHINK系統進行功能分解,並引入PRISM算法:一種基於過程獎勵模型(PRM)的推理指引機制,利用步驟級驗證來指導群體精煉與解決方案聚合。在精煉階段,PRISM將候選解視為PRM定義的能量場中的粒子,通過分數引導的重採樣與隨機優化重塑群體分佈,從而將概率質量集中於更高質量的推理路徑,同時保持多樣性。在數學與科學基準測試中,PRISM與現有DEEPTHINK方法相比具有競爭優勢或更優表現:使用gpt-oss-20b模型時,在AIME25、HMMT25和GPQA Diamond數據集上分別達到90.0%、75.4%和71.4%的準確率,且匹配甚至超越gpt-oss-120b的表現。此外,我們的分析表明PRISM在精煉過程中能實現持續的淨方向校正,在初始群體正確候選較少時仍保持可靠性,並常位於計算量-準確率的帕累托最優前沿。
English
DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a population-enhancement bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute. In this paper, we introduce a functional decomposition of DEEPTHINK systems and propose PRISM, a Process Reward Model (PRM)-guided inference algorithm that uses step-level verification to guide both population refinement and solution aggregation. During refinement, PRISM treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement, which concentrates probability mass on higher-quality reasoning while preserving diversity. Across mathematics and science benchmarks, PRISM is competitive with or outperforms existing DEEPTHINK methods, reaching 90.0%, 75.4%, and 71.4% with gpt-oss-20b on AIME25, HMMT25, and GPQA Diamond, respectively, while matching or exceeding gpt-oss-120b. Additionally, our analysis shows that PRISM produces consistent net-directional correction during refinement, remains reliable when the initial population contains few correct candidates, and often lies on the compute-accuracy Pareto frontier.
PDF201May 8, 2026