EconProver:迈向更经济的自动定理证明测试时扩展
EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving
September 16, 2025
作者: Mukai Li, Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong, Qi Liu, Haitao Mi, Dong Yu
cs.AI
摘要
大型语言模型(LLMs)近期在自动定理证明(ATP)领域取得了显著进展,通过广泛采用的测试时扩展策略,特别是反思性思维链(CoT)推理和增加采样次数,实现了性能的大幅提升。然而,这两种策略都引入了显著的推理计算开销。此外,现有的成本分析通常仅调控采样次数,而忽视了不同扩展策略带来的采样成本显著差异。本文系统比较了ATP模型不同测试时扩展策略的效率,并展示了当前最先进(SOTA)开源方法的低效性。随后,我们探索了在保持原有性能的同时,显著减少令牌使用和采样次数的方法。具体而言,我们提出了两种互补的方法,可整合至统一的EconRL流程中以放大效益:(1)动态思维链(CoT)切换机制,旨在减少不必要的令牌消耗;(2)带有可训练前缀的多样化并行扩展强化学习(RL),以在受限采样次数下提升通过率。在miniF2F和ProofNet上的实验表明,我们的EconProver仅需基线方法12%的计算成本,即可达到与之相当的性能。这项工作为部署轻量级ATP模型而不牺牲性能提供了可操作的洞见。
English
Large Language Models (LLMs) have recently advanced the field of Automated
Theorem Proving (ATP), attaining substantial performance gains through widely
adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT)
reasoning and increased sampling passes. However, they both introduce
significant computational overhead for inference. Moreover, existing cost
analyses typically regulate only the number of sampling passes, while
neglecting the substantial disparities in sampling costs introduced by
different scaling strategies. In this paper, we systematically compare the
efficiency of different test-time scaling strategies for ATP models and
demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source
approaches. We then investigate approaches to significantly reduce token usage
and sample passes while maintaining the original performance. Specifically, we
propose two complementary methods that can be integrated into a unified EconRL
pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching
mechanism designed to mitigate unnecessary token consumption, and (2) Diverse
parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance
pass rates under constrained sampling passes. Experiments on miniF2F and
ProofNet demonstrate that our EconProver achieves comparable performance to
baseline methods with only 12% of the computational cost. This work provides
actionable insights for deploying lightweight ATP models without sacrificing
performance.