EconProver: 자동 정리 증명을 위한 더 경제적인 테스트 타임 스케일링을 향하여

초록

대규모 언어 모델(LLMs)은 최근 자동 정리 증명(ATP) 분야를 크게 발전시켰으며, 특히 반영적 사고 연쇄(CoT) 추론과 증가된 샘플링 패스와 같은 널리 채택된 테스트 시간 확장 전략을 통해 상당한 성능 향상을 달성했습니다. 그러나 이러한 방법들은 모두 추론 과정에서 상당한 계산 오버헤드를 유발합니다. 더욱이, 기존의 비용 분석은 일반적으로 샘플링 패스의 수만을 규제하며, 다양한 확장 전략에 의해 도입된 샘플링 비용의 상당한 차이를 간과합니다. 본 논문에서는 ATP 모델을 위한 다양한 테스트 시간 확장 전략의 효율성을 체계적으로 비교하고, 현재의 최첨단(SOTA) 오픈소스 접근법의 비효율성을 입증합니다. 그런 다음, 원래의 성능을 유지하면서 토큰 사용량과 샘플링 패스를 크게 줄이는 방법을 탐구합니다. 구체적으로, 우리는 통합된 EconRL 파이프라인에 통합될 수 있는 두 가지 상호 보완적인 방법을 제안합니다: (1) 불필요한 토큰 소비를 완화하기 위해 설계된 동적 사고 연쇄(CoT) 전환 메커니즘, 그리고 (2) 제한된 샘플링 패스 하에서 패스율을 향상시키기 위한 학습 가능한 접두사를 갖춘 다양한 병렬 확장 강화 학습(RL). miniF2F와 ProofNet에서의 실험은 우리의 EconProver가 기준 방법들과 비슷한 성능을 달성하면서도 계산 비용의 12%만을 소비함을 보여줍니다. 이 연구는 성능을 희생하지 않고 경량 ATP 모델을 배포하기 위한 실행 가능한 통찰을 제공합니다.

English

Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.

EconProver: 자동 정리 증명을 위한 더 경제적인 테스트 타임 스케일링을 향하여

EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

초록

Support