深度搜索：通过蒙特卡洛树搜索克服可验证奖励的强化学习瓶颈

摘要

尽管RLVR（强化学习与验证推理）已成为提升大型语言模型（LLMs）高级推理能力的关键组成部分，但近期研究揭示，在历经数千次优化步骤后，训练进程会遭遇平台期，表现为尽管计算资源投入增加，性能提升却显著减缓。这一局限源于当前RLVR实践中固有的稀疏探索模式，模型依赖有限的模拟轨迹，常遗漏关键推理路径，未能系统覆盖解空间。为此，我们提出DeepSearch框架，它将蒙特卡洛树搜索直接融入RLVR训练过程。与现有方法仅在推理阶段采用树搜索不同，DeepSearch将结构化搜索嵌入训练循环，实现了推理步骤间的系统探索与细粒度信用分配。通过训练期间的深入探索，DeepSearch解决了探索不足这一根本瓶颈，该瓶颈导致长时间训练后性能改善趋于停滞。我们的贡献包括：（1）全局前沿选择策略，优先考虑搜索树中具有潜力的节点；（2）基于熵的引导选择，识别出用于监督的自信路径；（3）自适应回放缓冲区训练，结合解决方案缓存以提高效率。在数学推理基准测试中，DeepSearch实现了62.95%的平均准确率，为1.5B推理模型树立了新的技术标杆，且相比延长训练方法，GPU小时数减少了5.7倍。这些成果凸显了策略性探索相较于蛮力扩展的重要性，并展示了算法创新在推动RLVR方法论进步中的潜力。DeepSearch通过系统搜索而非延长计算时间，为扩展推理能力开辟了新方向。

English

Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

深度搜索：通过蒙特卡洛树搜索克服可验证奖励的强化学习瓶颈

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

摘要

Support