DeepSearch: モンテカルロ木探索による検証可能な報酬を用いて強化学習のボトルネックを克服する

要旨

RLVRは大規模言語モデル（LLM）における高度な推論能力の開発において不可欠な要素となっているが、近年の研究では、数千回の最適化ステップ後に生じる訓練の停滞が報告されており、計算資源の増加にもかかわらず性能向上が顕著に低下する現象が確認されている。この制約は、現在のRLVR手法に内在する探索パターンの希薄さに起因しており、モデルが限定的なロールアウトに依存することで、重要な推論経路を見逃し、解空間を体系的にカバーできないことが原因である。本論文では、モンテカルロ木探索をRLVR訓練に直接統合するフレームワークであるDeepSearchを提案する。既存手法が推論時にのみ木探索を利用するのに対し、DeepSearchは訓練ループに構造化された探索を組み込むことで、推論ステップ全体にわたる体系的な探索と細粒度の信用割り当てを可能にする。訓練時の探索を通じて、DeepSearchは、長期間の訓練ステップに伴う性能向上の鈍化の根本的なボトルネックである探索不足に対処する。我々の貢献は以下の通りである：（1）探索木全体で有望なノードを優先するグローバルフロンティア選択戦略、（2）確信度の高い経路を特定するエントロピー基盤のガイダンスを伴う選択、（3）効率性のための解キャッシュを活用した適応型リプレイバッファ訓練。数学的推論ベンチマークにおける実験では、DeepSearchは平均62.95%の精度を達成し、1.5B規模の推論モデルにおいて新たな最先端を確立した。これは、拡張訓練アプローチと比較して5.7倍少ないGPU時間で達成された。これらの結果は、力任せのスケーリングではなく戦略的な探索の重要性を強調し、RLVR手法を進化させるためのアルゴリズム革新の可能性を示している。DeepSearchは、長時間の計算ではなく体系的な探索を通じて推論能力を拡張する新たな方向性を確立する。

English

Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

DeepSearch: モンテカルロ木探索による検証可能な報酬を用いて強化学習のボトルネックを克服する

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

要旨

Support