DeepTheorem: 자연어와 강화 학습을 통한 정리 증명을 위한 대형 언어 모델의 추론 능력 향상

초록

정리 증명(theorem proving)은 대규모 언어 모델(LLMs)의 복잡한 추론 능력을 평가하는 주요 테스트베드 역할을 합니다. 그러나 기존의 자동화된 정리 증명(ATP) 접근법은 형식적인 증명 시스템에 크게 의존하며, 이는 LLM이 사전 학습 과정에서 습득한 비형식적 자연어 지식에서 나오는 강점과 잘 맞지 않습니다. 본 연구에서는 자연어를 활용하여 LLM의 수학적 추론을 강화하는 포괄적인 비형식적 정리 증명 프레임워크인 DeepTheorem을 제안합니다. DeepTheorem은 다양한 수학 영역에 걸친 121,000개의 고품질 IMO(국제수학올림피아드) 수준의 비형식적 정리와 증명으로 구성된 대규모 벤치마크 데이터셋을 포함하며, 이는 정확성, 난이도 및 주제 범주에 대해 엄격하게 주석 처리되었고, 체계적으로 구성된 검증 가능한 정리 변형을 동반합니다. 또한, 검증된 정리 변형을 활용하여 강력한 수학적 추론을 장려하도록 명시적으로 설계된 새로운 강화 학습 전략(RL-Zero)을 고안했습니다. 더불어, 증명의 정확성과 추론 단계의 질을 검토하는 포괄적인 결과 및 과정 평가 지표를 제안합니다. 광범위한 실험 분석을 통해 DeepTheorem이 기존 데이터셋과 지도 미세 조정 프로토콜에 비해 LLM의 정리 증명 성능을 크게 향상시키며, 최첨단의 정확도와 추론 품질을 달성함을 입증했습니다. 이러한 연구 결과는 DeepTheorem이 자동화된 비형식적 정리 증명과 수학적 탐구를 근본적으로 발전시킬 잠재력을 강조합니다.

English

Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.

DeepTheorem: 자연어와 강화 학습을 통한 정리 증명을 위한 대형 언어 모델의 추론 능력 향상

DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

초록

Support