Bourbaki: 정리 증명을 위한 자기 생성 및 목표 조건부 MDPs

초록

추론은 대규모 언어 모델(LLMs)에게 여전히 도전적인 과제로 남아 있으며, 특히 자동 정리 증명(ATP)의 논리적으로 제약된 환경에서는 희소한 보상과 방대한 규모의 증명으로 인해 더욱 어려워집니다. 이러한 도전은 대학 수준의 복잡한 다단계 추론을 요구하는 PutnamBench와 같은 벤치마크에서 더욱 두드러집니다. 이를 해결하기 위해, 우리는 자체 생성 목표 조건 MDPs(sG-MDPs)라는 새로운 프레임워크를 소개합니다. 이 프레임워크에서는 에이전트가 증명 상태의 변화에 따라 하위 목표를 생성하고 추구합니다. 이러한 구조화된 목표 생성 덕분에 결과적인 문제는 탐색에 더 적합해집니다. 그런 다음, 우리는 Monte Carlo Tree Search(MCTS)와 유사한 알고리즘을 적용하여 sG-MDP를 해결하고, 하위 목표 생성 및 전술 합성을 위해 여러 7B LLMs를 앙상블할 수 있는 모듈식 시스템인 Bourbaki(7B)에서 우리의 접근 방식을 구현합니다. PutnamBench에서 Bourbaki(7B)는 26개의 문제를 해결하며, 이 규모의 모델로는 새로운 최첨단 결과를 달성합니다.

English

Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.

Bourbaki: 정리 증명을 위한 자기 생성 및 목표 조건부 MDPs

Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

초록

Support