EconProver: Op Weg naar Economischere Testtijd Schaalbaarheid voor Automatische Stellingbewijzing

Samenvatting

Grote Taalmodellen (LLMs) hebben recentelijk het veld van Automatisch Theorema Bewijzen (ATP) vooruitgebracht, waarbij aanzienlijke prestatieverbeteringen zijn behaald door veelgebruikte schaalstrategieën tijdens het testen, met name reflectieve Chain-of-Thought (CoT) redenering en een verhoogd aantal sampling passes. Beide methoden introduceren echter aanzienlijke rekenkundige overhead voor inferentie. Bovendien reguleren bestaande kostenanalyses doorgaans alleen het aantal sampling passes, terwijl de aanzienlijke verschillen in samplingkosten die door verschillende schaalstrategieën worden geïntroduceerd, worden verwaarloosd. In dit artikel vergelijken we systematisch de efficiëntie van verschillende schaalstrategieën tijdens het testen voor ATP-modellen en tonen we de inefficiëntie aan van de huidige state-of-the-art (SOTA) open-source benaderingen. Vervolgens onderzoeken we benaderingen om het tokengebruik en het aantal sampling passes aanzienlijk te verminderen terwijl de oorspronkelijke prestaties behouden blijven. Specifiek stellen we twee complementaire methoden voor die kunnen worden geïntegreerd in een uniforme EconRL-pipeline voor versterkte voordelen: (1) een dynamisch Chain-of-Thought (CoT) schakelmechanisme ontworpen om onnodig tokenverbruik te verminderen, en (2) Diverse parallel-geschaalde reinforcement learning (RL) met trainbare prefixen om de slaagpercentages te verbeteren onder beperkte sampling passes. Experimenten op miniF2F en ProofNet demonstreren dat onze EconProver vergelijkbare prestaties behaalt als baseline-methoden met slechts 12% van de rekenkundige kosten. Dit werk biedt praktische inzichten voor het implementeren van lichtgewicht ATP-modellen zonder in te leveren op prestaties.

English

Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.

EconProver: Op Weg naar Economischere Testtijd Schaalbaarheid voor Automatische Stellingbewijzing

EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

Samenvatting

Support