DeepTheorem:通過自然語言與強化學習推進大語言模型的定理證明推理能力
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
May 29, 2025
作者: Ziyin Zhang, Jiahao Xu, Zhiwei He, Tian Liang, Qiuzhi Liu, Yansi Li, Linfeng Song, Zhengwen Liang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
cs.AI
摘要
定理證明作為評估大型語言模型(LLMs)複雜推理能力的主要測試平台。然而,傳統的自動化定理證明(ATP)方法過度依賴於形式化證明系統,這些系統與LLMs在預訓練期間獲得的非正式、自然語言知識所展現的優勢並不契合。在本研究中,我們提出了DeepTheorem,這是一個全面的非正式定理證明框架,利用自然語言來增強LLM的數學推理能力。DeepTheorem包含一個大規模的基準數據集,該數據集由121K個高質量的IMO級別的非正式定理和證明組成,涵蓋了多樣的數學領域,並嚴格註解了正確性、難度和主題類別,同時配備了系統構建的可驗證定理變體。我們設計了一種新穎的強化學習策略(RL-Zero),專門針對非正式定理證明,利用驗證過的定理變體來激勵穩健的數學推理。此外,我們提出了全面的結果和過程評估指標,以檢驗證明的正確性和推理步驟的質量。廣泛的實驗分析表明,與現有的數據集和監督微調協議相比,DeepTheorem顯著提升了LLM的定理證明性能,達到了最先進的準確性和推理質量。我們的研究結果凸顯了DeepTheorem在根本性推進自動化非正式定理證明和數學探索方面的潛力。
English
Theorem proving serves as a major testbed for evaluating complex reasoning
abilities in large language models (LLMs). However, traditional automated
theorem proving (ATP) approaches rely heavily on formal proof systems that
poorly align with LLMs' strength derived from informal, natural language
knowledge acquired during pre-training. In this work, we propose DeepTheorem, a
comprehensive informal theorem-proving framework exploiting natural language to
enhance LLM mathematical reasoning. DeepTheorem includes a large-scale
benchmark dataset consisting of 121K high-quality IMO-level informal theorems
and proofs spanning diverse mathematical domains, rigorously annotated for
correctness, difficulty, and topic categories, accompanied by systematically
constructed verifiable theorem variants. We devise a novel reinforcement
learning strategy (RL-Zero) explicitly tailored to informal theorem proving,
leveraging the verified theorem variants to incentivize robust mathematical
inference. Additionally, we propose comprehensive outcome and process
evaluation metrics examining proof correctness and the quality of reasoning
steps. Extensive experimental analyses demonstrate DeepTheorem significantly
improves LLM theorem-proving performance compared to existing datasets and
supervised fine-tuning protocols, achieving state-of-the-art accuracy and
reasoning quality. Our findings highlight DeepTheorem's potential to
fundamentally advance automated informal theorem proving and mathematical
exploration.Summary
AI-Generated Summary