EconProver: 自動定理証明のためのより経済的なテスト時スケーリングに向けて

要旨

大規模言語モデル（LLM）は最近、自動定理証明（ATP）の分野を大きく進展させ、広く採用されているテスト時のスケーリング戦略、特に反射的Chain-of-Thought（CoT）推論とサンプリングパスの増加を通じて、大幅な性能向上を達成しています。しかし、これらの手法はどちらも推論時に大きな計算コストを伴います。さらに、既存のコスト分析は通常、サンプリングパスの数のみを規制しており、異なるスケーリング戦略によって導入されるサンプリングコストの大きな差異を無視しています。本論文では、ATPモデルのための異なるテスト時スケーリング戦略の効率を体系的に比較し、現在の最先端（SOTA）オープンソース手法の非効率性を実証します。その後、元の性能を維持しながらトークン使用量とサンプリングパスを大幅に削減するアプローチを調査します。具体的には、統合されたEconRLパイプラインに組み込むことで相乗効果を発揮する2つの補完的な手法を提案します：（1）不必要なトークン消費を軽減するための動的Chain-of-Thought（CoT）切り替えメカニズム、（2）制約付きサンプリングパス下でのパス率を向上させるための学習可能なプレフィックスを備えた多様な並列スケーリング強化学習（RL）。miniF2FとProofNetでの実験により、我々のEconProverがベースライン手法と同等の性能をわずか12%の計算コストで達成できることを示します。この研究は、性能を犠牲にすることなく軽量なATPモデルを展開するための実践的な洞察を提供します。

English

Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.

EconProver: 自動定理証明のためのより経済的なテスト時スケーリングに向けて

EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

要旨

Support