大規模言語モデル（LLM）のための多ターンオフポリシー強化学習とマルチエージェント木探索のスケールアップステップ証明器

要旨

大規模言語モデル（LLMs）を自動定理証明に統合することは非常に有望であるが、訓練時の強化学習（RL）と推論時の計算資源の両方のスケーリングにおける課題によって根本的に制約されている。本論文では、この二重のスケーリング問題に対処するために設計されたシステム、BFS-Prover-V2を紹介する。我々は二つの主要な革新を提示する。第一に、訓練時にLLMステップ証明器の性能を継続的に向上させるための、新しいマルチターンオフポリシーRLフレームワークを提案する。このフレームワークは、AlphaZeroの原則にインスパイアされ、適応的な戦術レベルのデータフィルタリングと定期的な再訓練を特徴とする多段階のエキスパートイテレーションパイプラインを活用し、LLMベースのエージェントにおける長期的なRLの性能頭打ちを克服する。第二の革新は、推論時に推論能力をスケールするプランナー強化型マルチエージェント検索アーキテクチャである。このアーキテクチャは、高レベルのプランナーとして一般的な推論モデルを使用し、複雑な定理を一連のより単純なサブゴールに反復的に分解する。この階層的アプローチにより、検索空間が大幅に削減され、共有証明キャッシュを活用して並列証明エージェントのチームが効率的に協力できるようになる。我々は、この二重のスケーリングアプローチが、確立された形式数学ベンチマークで最先端の結果をもたらすことを実証する。BFS-Prover-V2は、MiniF2FとProofNetのテストセットでそれぞれ95.08％と41.4％を達成した。形式数学の領域で実証されているが、本論文で提示されたRLと推論技術は、長期的なマルチターン推論と複雑な検索を必要とする他の領域にも適用可能であり、広範な関心を集めるものである。

English

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces BFS-Prover-V2, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. BFS-Prover-V2 achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

大規模言語モデル（LLM）のための多ターンオフポリシー強化学習とマルチエージェント木探索のスケールアップステップ証明器

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

要旨

Support