PRISM:通过过程奖励模型引导的推理拓展深度思考前沿
PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference
March 3, 2026
作者: Rituraj Sharma, Weiyuan Chen, Noah Provenzano, Tu Vu
cs.AI
摘要
DEEPTHINK方法通过生成、优化和聚合候选解决方案群体来提升推理能力,从而在复杂数学与科学任务中实现强劲性能。然而,现有框架在推理过程中往往缺乏可靠的正误判断信号,这导致群体增强瓶颈——更深层的思考反而会放大错误、压制正确的少数派解决方案,并对额外计算资源产生微弱回报。本文提出对DEEPTHINK系统的功能解构,并引入PRISM算法:一种基于过程奖励模型(PRM)的推理引导机制,利用步骤级验证来指导群体优化与解决方案聚合。在优化阶段,PRISM将候选解视为PRM定义的能量场中的粒子,通过分数引导的重采样和随机优化重塑群体分布,从而在保持多样性的同时将概率质量集中于更高质量的推理路径。在数学与科学基准测试中,PRISM与现有DEEPTHINK方法相比具有竞争力或更优表现:使用gpt-oss-20b模型时,在AIME25、HMMT25和GPQA Diamond数据集上分别达到90.0%、75.4%和71.4%的准确率,同时达到或超越gpt-oss-120b的表现。此外,我们的分析表明PRISM在优化过程中能实现持续的正向修正,在初始群体包含极少正确候选解时仍保持可靠性,且其计算精度往往处于帕累托最优边界。
English
DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a population-enhancement bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute. In this paper, we introduce a functional decomposition of DEEPTHINK systems and propose PRISM, a Process Reward Model (PRM)-guided inference algorithm that uses step-level verification to guide both population refinement and solution aggregation. During refinement, PRISM treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement, which concentrates probability mass on higher-quality reasoning while preserving diversity. Across mathematics and science benchmarks, PRISM is competitive with or outperforms existing DEEPTHINK methods, reaching 90.0%, 75.4%, and 71.4% with gpt-oss-20b on AIME25, HMMT25, and GPQA Diamond, respectively, while matching or exceeding gpt-oss-120b. Additionally, our analysis shows that PRISM produces consistent net-directional correction during refinement, remains reliable when the initial population contains few correct candidates, and often lies on the compute-accuracy Pareto frontier.