DeepTheorem: 自然言語と強化学習による定理証明のためのLLM推論の進化

要旨

定理証明は、大規模言語モデル（LLMs）の複雑な推論能力を評価する主要なテストベッドとして機能している。しかし、従来の自動定理証明（ATP）アプローチは、LLMsが事前学習中に獲得した非公式な自然言語知識に基づく強みとあまり整合しない形式的証明システムに大きく依存している。本研究では、自然言語を活用してLLMの数学的推論を強化する包括的な非公式定理証明フレームワークであるDeepTheoremを提案する。DeepTheoremは、多様な数学領域にわたる121Kの高品質なIMOレベルの非公式定理と証明からなる大規模ベンチマークデータセットを含み、正確性、難易度、トピックカテゴリについて厳密に注釈が付けられ、体系的に構築された検証可能な定理バリアントが付随している。我々は、非公式定理証明に特化した新しい強化学習戦略（RL-Zero）を考案し、検証済みの定理バリアントを活用して堅牢な数学的推論を促進する。さらに、証明の正確性と推論ステップの質を検証する包括的な結果およびプロセス評価指標を提案する。広範な実験分析により、DeepTheoremが既存のデータセットや教師ありファインチューニングプロトコルと比較してLLMの定理証明性能を大幅に向上させ、最先端の精度と推論品質を達成することが実証された。我々の知見は、DeepTheoremが自動化された非公式定理証明と数学的探求を根本的に進化させる可能性を強調している。

English

Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.