在TPTP生態系統中基於飽和驅動的LLM數學推理數據集生成
Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem
September 8, 2025
作者: Valentin Quesnel, Damien Sileo
cs.AI
摘要
高質量且邏輯嚴密的數據稀缺,是提升大型語言模型(LLMs)數學推理能力的關鍵瓶頸。我們的工作直面這一挑戰,將數十年來自動定理證明的研究成果轉化為可擴展的數據引擎。與依賴易出錯的LLMs或Lean、Isabelle等複雜證明助手語法不同,我們的框架利用E-prover在龐大TPTP公理庫上的飽和能力,推導出一個大規模且保證有效的定理語料庫。我們的流程原則明確且簡潔:飽和公理,篩選“有趣”定理,並生成任務。由於全程不涉及LLMs,我們從根本上避免了事實錯誤。這份純符號化的數據隨後被轉化為三個難度可控的挑戰:蘊涵驗證、前提選擇和證明重建。我們在尖端模型上的零樣本實驗揭示了一個明顯的弱點:在需要深度結構推理的任務上,模型表現急劇下降。我們的框架不僅提供了衡量這一差距的診斷工具,還提供了一個可擴展的符號訓練數據源來彌補這一不足。我們已將代碼和數據公開。
https://github.com/sileod/reasoning_core
https://hf.co/datasets/reasoning-core/rc1
English
The scarcity of high-quality, logically sound data is a critical bottleneck
for advancing the mathematical reasoning of Large Language Models (LLMs). Our
work confronts this challenge by turning decades of automated theorem proving
research into a scalable data engine. Rather than relying on error-prone LLMs
or complex proof-assistant syntax like Lean and Isabelle, our framework
leverages E-prover's saturation capabilities on the vast TPTP axiom library to
derive a massive, guaranteed-valid corpus of theorems. Our pipeline is
principled and simple: saturate axioms, filter for "interesting" theorems, and
generate tasks. With no LLMs in the loop, we eliminate factual errors by
construction. This purely symbolic data is then transformed into three
difficulty-controlled challenges: entailment verification, premise selection,
and proof reconstruction. Our zero-shot experiments on frontier models reveal a
clear weakness: performance collapses on tasks requiring deep, structural
reasoning. Our framework provides both the diagnostic tool to measure this gap
and a scalable source of symbolic training data to address it. We make the code
and data publicly available.
https://github.com/sileod/reasoning_core
https://hf.co/datasets/reasoning-core/rc1