擴展多輪次離策略強化學習與多智能體樹搜索於大型語言模型的逐步證明

摘要

大型語言模型（LLMs）與自動定理證明的整合展現了巨大的潛力，但這一整合從根本上受到訓練時強化學習（RL）和推理時計算資源擴展的雙重挑戰所限制。本文介紹了BFS-Prover-V2系統，旨在解決這一雙重擴展問題。我們提出了兩項主要創新。第一項是針對訓練時持續提升LLM步驟證明器性能的新型多輪次離策略RL框架。該框架受AlphaZero原理啟發，採用多階段專家迭代管道，結合自適應戰術級數據過濾和定期重訓練，以克服通常限制基於LLM的代理長期RL的性能瓶頸。第二項創新是規劃器增強的多代理搜索架構，用於在推理時擴展推理能力。該架構利用通用推理模型作為高層次規劃器，迭代地將複雜定理分解為一系列更簡單的子目標。這種分層方法大幅縮小了搜索空間，使一組並行證明代理能夠通過共享證明緩存高效協作。我們證明，這種雙重擴展方法在已建立的正式數學基準測試中取得了最先進的成果。BFS-Prover-V2在MiniF2F和ProofNet測試集上分別達到了95.08%和41.4%的成績。雖然這些RL和推理技術在正式數學領域中得到了展示，但它們具有更廣泛的應用價值，可應用於其他需要長視野多輪次推理和複雜搜索的領域。

English

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces BFS-Prover-V2, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. BFS-Prover-V2 achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

擴展多輪次離策略強化學習與多智能體樹搜索於大型語言模型的逐步證明

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

摘要

Support