扩展多轮次离策略强化学习与多智能体树搜索在大型语言模型中的逐步验证应用
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
September 8, 2025
作者: Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao
cs.AI
摘要
将大型语言模型(LLMs)整合到自动定理证明中已展现出巨大潜力,但其发展从根本上受限于训练阶段的强化学习(RL)和推理阶段计算资源的双重扩展难题。本文介绍了BFS-Prover-V2系统,旨在解决这一双重扩展问题。我们提出了两项主要创新。首先,是一种新颖的多轮次离策略RL框架,用于在训练期间持续提升LLM步骤证明器的性能。该框架借鉴了AlphaZero的原理,采用多阶段专家迭代流程,结合自适应策略级数据过滤与定期重训练,以突破通常限制基于LLM的智能体长期RL性能的瓶颈。其次,是一种规划增强的多智能体搜索架构,用于在推理时扩展推理能力。此架构利用通用推理模型作为高层规划器,迭代地将复杂定理分解为一系列更简单的子目标。这种分层方法显著缩小了搜索空间,使得一组并行证明智能体能够通过共享证明缓存高效协作。我们证明,这种双重扩展方法在形式数学基准测试中取得了最先进的成果。BFS-Prover-V2在MiniF2F和ProofNet测试集上分别达到了95.08%和41.4%的准确率。尽管本工作以形式数学领域为展示平台,但所提出的RL与推理技术具有更广泛的适用性,可应用于其他需要长程多轮次推理和复杂搜索的领域。
English
The integration of Large Language Models (LLMs) into automated theorem
proving has shown immense promise, yet is fundamentally constrained by
challenges in scaling up both training-time reinforcement learning (RL) and
inference-time compute. This paper introduces BFS-Prover-V2, a system
designed to address this dual scaling problem. We present two primary
innovations. The first is a novel multi-turn off-policy RL framework for
continually improving the performance of LLM step-prover at training time. This
framework, inspired by the principles of AlphaZero, utilizes a multi-stage
expert iteration pipeline featuring adaptive tactic-level data filtering and
periodic retraining to surmount the performance plateaus that typically curtail
long-term RL in LLM-based agents. The second innovation is a planner-enhanced
multi-agent search architecture that scales reasoning capabilities at inference
time. This architecture employs a general reasoning model as a high-level
planner to iteratively decompose complex theorems into a sequence of simpler
subgoals. This hierarchical approach substantially reduces the search space,
enabling a team of parallel prover agents to collaborate efficiently by
leveraging a shared proof cache. We demonstrate that this dual approach to
scaling yields state-of-the-art results on established formal mathematics
benchmarks. BFS-Prover-V2 achieves 95.08\% and 41.4\% on the MiniF2F
and ProofNet test sets respectively. While demonstrated in the domain of formal
mathematics, the RL and inference techniques presented in this work are of
broader interest and may be applied to other domains requiring long-horizon
multi-turn reasoning and complex search.