炼金术：通过符号突变增强定理证明能力

摘要

即使对经验丰富的专家来说，编写形式化证明也是具有挑战性的。最近神经定理证明（NTP）领域的进展显示出加快这一过程的潜力。然而，互联网上可用的形式化语料库相对于一般文本来说是有限的，这给NTP带来了重要的数据稀缺挑战。为了解决这一问题，本研究提出了Alchemy，一个通用的数据合成框架，通过符号变异构建形式化定理。具体而言，对于Mathlib中的每个候选定理，我们确定所有可调用的定理，这些定理可以用于重写或应用于它。随后，我们通过用等价形式或前提替换陈述中的相应术语来变异候选定理。因此，我们的方法将Mathlib中的定理数量增加了一个数量级，从110k增加到6M。此外，我们对这个增广语料库进行持续的预训练和监督微调，用于大型语言模型。实验结果表明了我们方法的有效性，在Leandojo基准测试中实现了5%的绝对性能提升。此外，我们的合成数据在超出分布的miniF2F基准测试中获得了2.5%的绝对性能提升。为了提供更多见解，我们对合成数据的构成和训练范式进行了全面分析，为开发强大的定理证明器提供了有价值的指导。

English

Formal proofs are challenging to write even for experienced experts. Recent progress in Neural Theorem Proving (NTP) shows promise in expediting this process. However, the formal corpora available on the Internet are limited compared to the general text, posing a significant data scarcity challenge for NTP. To address this issue, this work proposes Alchemy, a general framework for data synthesis that constructs formal theorems through symbolic mutation. Specifically, for each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it. Subsequently, we mutate the candidate theorem by replacing the corresponding term in the statement with its equivalent form or antecedent. As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M. Furthermore, we perform continual pretraining and supervised finetuning on this augmented corpus for large language models. Experimental results demonstrate the effectiveness of our approach, achieving a 5% absolute performance improvement on Leandojo benchmark. Additionally, our synthetic data achieve a 2.5% absolute performance gain on the out-of-distribution miniF2F benchmark. To provide further insights, we conduct a comprehensive analysis of synthetic data composition and the training paradigm, offering valuable guidance for developing a strong theorem prover.

炼金术：通过符号突变增强定理证明能力

Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation

摘要

Support