深度搜索:通过蒙特卡洛树搜索克服可验证奖励的强化学习瓶颈
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
September 29, 2025
作者: Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi
cs.AI
摘要
尽管RLVR(强化学习与验证推理)已成为提升大型语言模型(LLMs)高级推理能力的关键组成部分,但近期研究揭示,在历经数千次优化步骤后,训练进程会遭遇平台期,表现为尽管计算资源投入增加,性能提升却显著减缓。这一局限源于当前RLVR实践中固有的稀疏探索模式,模型依赖有限的模拟轨迹,常遗漏关键推理路径,未能系统覆盖解空间。为此,我们提出DeepSearch框架,它将蒙特卡洛树搜索直接融入RLVR训练过程。与现有方法仅在推理阶段采用树搜索不同,DeepSearch将结构化搜索嵌入训练循环,实现了推理步骤间的系统探索与细粒度信用分配。通过训练期间的深入探索,DeepSearch解决了探索不足这一根本瓶颈,该瓶颈导致长时间训练后性能改善趋于停滞。我们的贡献包括:(1)全局前沿选择策略,优先考虑搜索树中具有潜力的节点;(2)基于熵的引导选择,识别出用于监督的自信路径;(3)自适应回放缓冲区训练,结合解决方案缓存以提高效率。在数学推理基准测试中,DeepSearch实现了62.95%的平均准确率,为1.5B推理模型树立了新的技术标杆,且相比延长训练方法,GPU小时数减少了5.7倍。这些成果凸显了策略性探索相较于蛮力扩展的重要性,并展示了算法创新在推动RLVR方法论进步中的潜力。DeepSearch通过系统搜索而非延长计算时间,为扩展推理能力开辟了新方向。
English
Although RLVR has become an essential component for developing advanced
reasoning skills in LLMs, contemporary studies have documented training
plateaus that emerge following thousands of optimization steps, demonstrating
notable decreases in performance gains despite increased computational
investment. This limitation stems from the sparse exploration patterns inherent
in current RLVR practices, where models rely on limited rollouts that often
miss critical reasoning paths and fail to provide systematic coverage of the
solution space. We present DeepSearch, a framework that integrates Monte Carlo
Tree Search directly into RLVR training. In contrast to existing methods that
rely on tree search only at inference, DeepSearch embeds structured search into
the training loop, enabling systematic exploration and fine-grained credit
assignment across reasoning steps. Through training-time exploration,
DeepSearch addresses the fundamental bottleneck of insufficient exploration,
which leads to diminishing performance improvements over prolonged training
steps. Our contributions include: (1) a global frontier selection strategy that
prioritizes promising nodes across the search tree, (2) selection with
entropy-based guidance that identifies confident paths for supervision, and (3)
adaptive replay buffer training with solution caching for efficiency.
Experiments on mathematical reasoning benchmarks show that DeepSearch achieves
62.95% average accuracy and establishes a new state-of-the-art for 1.5B
reasoning models - using 5.7x fewer GPU hours than extended training
approaches. These results highlight the importance of strategic exploration
over brute-force scaling and demonstrate the promise of algorithmic innovation
for advancing RLVR methodologies. DeepSearch establishes a new direction for
scaling reasoning capabilities through systematic search rather than prolonged
computation.