PRISM: プロセス報酬モデル誘導推論による深い思考のフロンティア拡張

要旨

DEEPTHINK手法は、候補解の集団を生成、洗練、統合することで推論能力を向上させ、複雑な数学的・科学的タスクにおいて強力な性能を発揮する。しかし、既存のフレームワークでは推論時に信頼性の高い正解シグナルが不足しており、集団強化のボトルネックが生じている。このボトルネックにより、深い考察が誤りを増幅し、正しい少数派の解を抑制し、追加の計算資源に対する見返りが低下するという問題がある。本論文では、DEEPTHINKシステムの機能分解を導入し、プロセス報酬モデル（PRM）に導かれた推論アルゴリズムPRISMを提案する。PRISMはステップ単位の検証を用いて、集団の洗練と解の統合の両方を指導する。洗練段階では、PRISMは候補解をPRMが定義するエネルギー景観内の粒子として扱い、スコア誘導型再サンプリングと確率的洗練を通じて集団を再形成する。これにより、多様性を維持しつつ、より高品質な推論に確率質量を集中させる。数学および科学ベンチマークにおいて、PRISMは既存のDEEPTHINK手法と同等以上の性能を示し、gpt-oss-20bにおいてAIME25で90.0%、HMMT25で75.4%、GPQA Diamondで71.4%を達成するとともに、gpt-oss-120bの性能を匹敵または上回った。さらに分析により、PRISMが洗練過程で一貫した正味方向補正を生成し、初期集団に正解候補が少ない場合でも信頼性を維持し、多くの場合で計算量-精度パレートフロンティア上に位置することが明らかとなった。

English

DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a population-enhancement bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute. In this paper, we introduce a functional decomposition of DEEPTHINK systems and propose PRISM, a Process Reward Model (PRM)-guided inference algorithm that uses step-level verification to guide both population refinement and solution aggregation. During refinement, PRISM treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement, which concentrates probability mass on higher-quality reasoning while preserving diversity. Across mathematics and science benchmarks, PRISM is competitive with or outperforms existing DEEPTHINK methods, reaching 90.0%, 75.4%, and 71.4% with gpt-oss-20b on AIME25, HMMT25, and GPQA Diamond, respectively, while matching or exceeding gpt-oss-120b. Additionally, our analysis shows that PRISM produces consistent net-directional correction during refinement, remains reliable when the initial population contains few correct candidates, and often lies on the compute-accuracy Pareto frontier.

PRISM: プロセス報酬モデル誘導推論による深い思考のフロンティア拡張

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

要旨

Support