ChatPaper.aiChatPaper

DeepTheorem:通过自然语言与强化学习提升大语言模型的定理证明推理能力

DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

May 29, 2025
作者: Ziyin Zhang, Jiahao Xu, Zhiwei He, Tian Liang, Qiuzhi Liu, Yansi Li, Linfeng Song, Zhengwen Liang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
cs.AI

摘要

定理证明作为评估大型语言模型(LLMs)复杂推理能力的重要测试平台。然而,传统的自动定理证明(ATP)方法严重依赖形式化证明系统,这些系统与LLMs在预训练期间获得的非正式自然语言知识所展现的优势并不契合。在本研究中,我们提出了DeepTheorem,一个全面的非正式定理证明框架,利用自然语言增强LLM的数学推理能力。DeepTheorem包含一个大规模基准数据集,涵盖121K个高质量的IMO级别非正式定理及证明,横跨多个数学领域,并严格标注了正确性、难度及主题分类,同时配有系统构建的可验证定理变体。我们设计了一种新颖的强化学习策略(RL-Zero),专门针对非正式定理证明,利用已验证的定理变体激励稳健的数学推理。此外,我们提出了全面的结果与过程评估指标,检验证明的正确性及推理步骤的质量。广泛的实验分析表明,与现有数据集及监督微调协议相比,DeepTheorem显著提升了LLM的定理证明性能,达到了最先进的准确率与推理质量。我们的发现凸显了DeepTheorem在推动自动化非正式定理证明及数学探索方面的根本性潜力。
English
Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.

Summary

AI-Generated Summary

PDF142May 30, 2025