ChatPaper.aiChatPaper

EconProver:邁向更經濟的自動定理證明測試階段擴展

EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

September 16, 2025
作者: Mukai Li, Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong, Qi Liu, Haitao Mi, Dong Yu
cs.AI

摘要

大型語言模型(LLMs)近期在自動定理證明(ATP)領域取得了顯著進展,通過廣泛採用的測試時擴展策略,特別是反思性思維鏈(CoT)推理和增加採樣次數,實現了性能的大幅提升。然而,這兩種策略都為推理引入了顯著的計算開銷。此外,現有的成本分析通常僅調節採樣次數,而忽略了不同擴展策略引入的採樣成本顯著差異。本文系統地比較了ATP模型不同測試時擴展策略的效率,並展示了當前最先進(SOTA)開源方法的低效性。隨後,我們研究了在保持原有性能的同時,顯著減少令牌使用和採樣次數的方法。具體而言,我們提出了兩種互補的方法,可以整合到一個統一的EconRL管道中以獲得放大效益:(1)動態思維鏈(CoT)切換機制,旨在減少不必要的令牌消耗;(2)多樣化並行擴展的強化學習(RL)與可訓練前綴,以在受限採樣次數下提高通過率。在miniF2F和ProofNet上的實驗表明,我們的EconProver僅以12%的計算成本就達到了與基線方法相當的性能。這項工作為在不犧牲性能的情況下部署輕量級ATP模型提供了可操作的見解。
English
Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.
PDF62September 17, 2025