대규모 다중 턴 오프-폴리시 강화학습 및 다중 에이전트 트리 탐색을 활용한 LLM 단계적 증명기 확장

초록

대규모 언어 모델(LLMs)을 자동 정리 증명에 통합하는 것은 엄청난 가능성을 보여주었지만, 훈련 시 강화 학습(RL)과 추론 시 계산 자원의 확장성 문제로 인해 근본적인 제약을 받고 있습니다. 본 논문은 이러한 이중 확장 문제를 해결하기 위해 설계된 BFS-Prover-V2 시스템을 소개합니다. 우리는 두 가지 주요 혁신을 제시합니다. 첫 번째는 훈련 시 LLM 단계 증명기의 성능을 지속적으로 개선하기 위한 새로운 다중 턴 오프-폴리시 RL 프레임워크입니다. 이 프레임워크는 AlphaZero의 원리에 영감을 받아, 적응형 전술 수준 데이터 필터링과 주기적 재훈련을 특징으로 하는 다단계 전문가 반복 파이프라인을 통해 LLM 기반 에이전트의 장기 RL에서 일반적으로 발생하는 성능 정체를 극복합니다. 두 번째 혁신은 추론 시 추론 능력을 확장하는 플래너 강화 다중 에이전트 탐색 아키텍처입니다. 이 아키텍처는 일반 추론 모델을 고수준 플래너로 사용하여 복잡한 정리들을 일련의 단순한 하위 목표로 반복적으로 분해합니다. 이 계층적 접근법은 탐색 공간을 크게 줄여, 병렬 증명 에이전트 팀이 공유 증명 캐시를 활용하여 효율적으로 협업할 수 있게 합니다. 우리는 이러한 이중 확장 접근법이 기존의 형식적 수학 벤치마크에서 최첨단 결과를 달성함을 보여줍니다. BFS-Prover-V2는 MiniF2F와 ProofNet 테스트 세트에서 각각 95.08%와 41.4%의 성능을 달성했습니다. 본 연구에서 제시된 RL 및 추론 기술은 형식적 수학 영역에서 입증되었지만, 장기적 다중 턴 추론과 복잡한 탐색이 필요한 다른 영역에도 적용될 수 있는 광범위한 관심을 가질 수 있습니다.

English

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces BFS-Prover-V2, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. BFS-Prover-V2 achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

대규모 다중 턴 오프-폴리시 강화학습 및 다중 에이전트 트리 탐색을 활용한 LLM 단계적 증명기 확장

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

초록

Support