布尔巴基：面向定理证明的自生成与目标导向马尔可夫决策过程

摘要

推理对于大型语言模型（LLMs）而言仍是一项艰巨任务，尤其是在自动定理证明（ATP）这一逻辑约束严格的环境中，这归因于奖励稀疏及证明规模庞大。在诸如PutnamBench这类包含大学级别、需要复杂多步推理问题的基准测试中，这些挑战尤为突出。为此，我们引入了自生成目标条件马尔可夫决策过程（sG-MDPs），这一新框架允许代理根据不断演进的证明状态生成并追求其子目标。通过这种更为结构化的目标生成方式，问题变得更加适合搜索解决。随后，我们采用类似蒙特卡洛树搜索（MCTS）的算法来求解sG-MDP，并在Bourbaki（7B）系统中实现这一方法，该系统能够集成多个7B规模的LLMs进行子目标生成与策略合成。在PutnamBench上，Bourbaki（7B）成功解决了26个问题，以同等规模模型创下了新的技术领先记录。

English

Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.

布尔巴基：面向定理证明的自生成与目标导向马尔可夫决策过程

Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

摘要

Support