布尔巴基：面向定理证明的自生成与目标导向马尔可夫决策过程

摘要

推理对于大型语言模型（LLMs）而言仍是一项具有挑战性的任务，特别是在自动定理证明（ATP）这一逻辑约束严格的环境中，由于奖励稀疏且证明规模庞大，这一挑战尤为突出。在诸如PutnamBench这样的基准测试中，这些挑战被进一步放大，因为其中包含的大学级别问题需要复杂、多步骤的推理。为解决这一问题，我们引入了自生成目标条件马尔可夫决策过程（sG-MDPs），这是一种新框架，在此框架下，代理根据不断演进的证明状态生成并追求其子目标。鉴于目标生成更为结构化，由此产生的问题更易于搜索。随后，我们应用类似蒙特卡洛树搜索（MCTS）的算法来求解sG-MDP，并在Bourbaki（7B）中实例化我们的方法，Bourbaki（7B）是一个模块化系统，能够集成多个7B规模的LLM用于子目标生成与策略合成。在PutnamBench上，Bourbaki（7B）解决了26个问题，以这一规模的模型取得了新的最先进成果。

English

Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.