価値誘導型探索による効率的な連鎖思考推論

要旨

本論文では、長文脈推論トレースにおける価値モデル訓練のためのシンプルで効率的な手法を提案する。既存のプロセス報酬モデル（PRMs）と比較して、本手法は長文脈推論モデルにおいて定義が困難な「ステップ」という細粒度の概念を必要としない。250万の推論トレースからなるデータセットを収集し、1.5Bトークンレベルの価値モデルを訓練し、それをDeepSeekモデルに適用することで、テスト時の計算スケーリングにおける性能向上を実現した。ブロック単位の価値誘導探索（VGS）と最終的な加重多数決を組み合わせることで、多数決やbest-of-nなどの標準的な手法よりも優れたテスト時スケーリングを達成できることがわかった。64世代の推論予算において、DeepSeek-R1-Distill-1.5Bを用いたVGSは、4つの数学コンペティションベンチマーク（AIME 2024 & 2025、HMMT Feb 2024 & 2025）で平均45.7%の精度を達成し、o3-mini-mediumと同等の性能を示した。さらに、VGSは多数決と同じ性能を達成するために必要な推論FLOPsを大幅に削減する。本データセット、モデル、コードベースはオープンソースとして公開されている。

English

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024 & 2025, HMMT Feb 2024 & 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

価値誘導型探索による効率的な連鎖思考推論

Value-Guided Search for Efficient Chain-of-Thought Reasoning

要旨

Support