Solve-Detect-Verify: 柔軟な生成検証器を用いた推論時のスケーリング

要旨

大規模言語モデル（LLM）による複雑なタスクの推論は、本質的に解の精度と計算効率のトレードオフを伴う。その後の検証ステップは、性能向上を目的としているものの、独自の難しいトレードオフを導入することでこの状況をさらに複雑にする。具体的には、洗練された生成的報酬モデル（GenRM）は、テスト時にLLMと単純に統合すると計算コストが過大になる可能性がある一方で、より単純で高速な手法は信頼性に欠ける場合がある。これらの課題を克服するため、我々はFlexiVeを提案する。これは、検証予算の柔軟な割り当て戦略を用いて、迅速で信頼性の高い「速い思考」と緻密な「遅い思考」の間で計算リソースを柔軟にバランスさせる新しい生成的検証器である。さらに、Solve-Detect-Verifyパイプラインを提案する。これは、FlexiVeをインテリジェントに統合し、解の完了ポイントを積極的に特定してターゲットを絞った検証をトリガーし、ソルバーに焦点を当てたフィードバックを提供する効率的な推論時スケーリングフレームワークである。実験結果は、FlexiVeがProcessBenchにおける推論トレース内のエラーを正確に特定する優れた精度を達成することを示している。さらに、挑戦的な数学的推論ベンチマーク（AIME 2024、AIME 2025、CNMO）において、我々のアプローチは、自己整合性などのベースラインを推論精度と推論効率の両面で上回る。本システムは、テスト時のLLM推論を強化するためのスケーラブルで効果的なソリューションを提供する。

English

Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.

Solve-Detect-Verify: 柔軟な生成検証器を用いた推論時のスケーリング

Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier

要旨

Support