扩展多轮次离策略强化学习与多智能体树搜索在大型语言模型中的逐步验证应用

摘要

将大型语言模型（LLMs）整合到自动定理证明中已展现出巨大潜力，但其发展从根本上受限于训练阶段的强化学习（RL）和推理阶段计算资源的双重扩展难题。本文介绍了BFS-Prover-V2系统，旨在解决这一双重扩展问题。我们提出了两项主要创新。首先，是一种新颖的多轮次离策略RL框架，用于在训练期间持续提升LLM步骤证明器的性能。该框架借鉴了AlphaZero的原理，采用多阶段专家迭代流程，结合自适应策略级数据过滤与定期重训练，以突破通常限制基于LLM的智能体长期RL性能的瓶颈。其次，是一种规划增强的多智能体搜索架构，用于在推理时扩展推理能力。此架构利用通用推理模型作为高层规划器，迭代地将复杂定理分解为一系列更简单的子目标。这种分层方法显著缩小了搜索空间，使得一组并行证明智能体能够通过共享证明缓存高效协作。我们证明，这种双重扩展方法在形式数学基准测试中取得了最先进的成果。BFS-Prover-V2在MiniF2F和ProofNet测试集上分别达到了95.08%和41.4%的准确率。尽管本工作以形式数学领域为展示平台，但所提出的RL与推理技术具有更广泛的适用性，可应用于其他需要长程多轮次推理和复杂搜索的领域。

English

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces BFS-Prover-V2, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. BFS-Prover-V2 achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

扩展多轮次离策略强化学习与多智能体树搜索在大型语言模型中的逐步验证应用

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

摘要

Support