基于TPTP生态系统的LLM数学推理饱和驱动数据集生成

摘要

高质量、逻辑严谨的数据稀缺，已成为提升大型语言模型（LLMs）数学推理能力的关键瓶颈。针对这一挑战，我们的研究将数十年的自动定理证明成果转化为可扩展的数据引擎。不同于依赖易出错的LLMs或如Lean和Isabelle等复杂的证明辅助语法，我们的框架利用E-prover在庞大的TPTP公理库上的饱和能力，生成了一个规模巨大且保证有效的定理语料库。我们的流程原则明确且简洁：饱和公理，筛选“有趣”定理，并生成任务。通过完全排除LLMs的参与，我们从构造上避免了事实错误。随后，这一纯符号数据被转化为三个难度可控的挑战：蕴含验证、前提选择及证明重构。在顶尖模型上的零样本实验揭示了一个明显弱点：在需要深度、结构化推理的任务上，性能急剧下降。我们的框架不仅提供了衡量这一差距的诊断工具，还提供了一个可扩展的符号训练数据源以弥补这一不足。我们公开了代码与数据集。 https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1

English

The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover's saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for "interesting" theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1

基于TPTP生态系统的LLM数学推理饱和驱动数据集生成

Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem

摘要

Support