Bourbaki: 定理証明のための自己生成型および目標条件付きMDP

要旨

推論は、大規模言語モデル（LLM）にとって依然として困難な課題であり、特に自動定理証明（ATP）の論理的に制約された環境では、報酬が疎で証明の規模が膨大であるため、その難しさが増します。これらの課題は、大学レベルの問題を含み、複雑で多段階の推論を必要とするPutnamBenchのようなベンチマークでさらに顕著になります。これに対処するため、我々は自己生成目標条件付きMDP（sG-MDP）という新しいフレームワークを導入します。このフレームワークでは、エージェントが証明状態の変化に基づいてサブゴールを生成し、それを追求します。このように目標をより構造化して生成することで、結果として得られる問題は探索に適したものになります。次に、モンテカルロ木探索（MCTS）に似たアルゴリズムを適用してsG-MDPを解き、サブゴール生成と戦略合成のために複数の7B LLMをアンサンブルできるモジュールシステムであるBourbaki（7B）に我々のアプローチを実装します。PutnamBenchにおいて、Bourbaki（7B）は26の問題を解決し、この規模のモデルでは新たな最先端の結果を達成しました。

English

Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.

Bourbaki: 定理証明のための自己生成型および目標条件付きMDP

Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

要旨

Support