TPTP 생태계 내 LLM 수학적 추론을 위한 포화 기반 데이터셋 생성

초록

고품질이며 논리적으로 타당한 데이터의 부족은 대규모 언어 모델(LLMs)의 수학적 추론 능력을 발전시키는 데 있어 중요한 병목 현상입니다. 우리의 연구는 이 문제에 직면하여 수십 년간의 자동 정리 증명 연구를 확장 가능한 데이터 엔진으로 전환합니다. 오류가 발생하기 쉬운 LLMs나 Lean 및 Isabelle과 같은 복잡한 증명 보조 도구의 구문에 의존하는 대신, 우리의 프레임워크는 TPTP 공리 라이브러리에서 E-prover의 포화 능력을 활용하여 방대하고 검증된 정리 코퍼스를 도출합니다. 우리의 파이프라인은 원칙적이고 단순합니다: 공리를 포화시키고, "흥미로운" 정리를 필터링하며, 과제를 생성합니다. LLMs를 사용하지 않음으로써, 우리는 구조적으로 사실 오류를 제거합니다. 이 순수 기호 데이터는 세 가지 난이도 조절 과제로 변환됩니다: 함의 검증, 전제 선택, 그리고 증명 재구성입니다. 최첨단 모델에 대한 제로샷 실험은 깊고 구조적인 추론이 필요한 과제에서 성능이 급격히 저하되는 명확한 약점을 보여줍니다. 우리의 프레임워크는 이 격차를 측정할 수 있는 진단 도구와 이를 해결하기 위한 확장 가능한 기호 학습 데이터의 원천을 제공합니다. 우리는 코드와 데이터를 공개적으로 제공합니다. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1

English

The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover's saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for "interesting" theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1

TPTP 생태계 내 LLM 수학적 추론을 위한 포화 기반 데이터셋 생성

Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem

초록

Support