정리 증명을 넘어서: 형식적 문제 해결을 위한 공식화, 프레임워크 및 벤치마크

초록

표면적으로는 자명한 작업으로 보이는 문제 해결은 과학과 공학에서 중요한 구성 요소로 자리 잡아 왔다. 그러나 문제 해결 자체에 대한 일반적이면서도 구체적인 형식화는 아직 부재한 상태이다. 최근 AI 기반 문제 해결 에이전트의 발전과 함께, 프로세스 수준에서의 검증 가능성에 대한 요구가 급증하고 있으나 이에 대한 탐구는 여전히 미흡하다. 이러한 격차를 메우기 위해, 우리는 문제 해결을 결정론적 마르코프 결정 과정으로 형식화하는 원칙적인 접근법을 제시한다. 이는 기존의 FTP(형식적 정리 증명) 환경을 활용하여 프로세스 검증된 문제 해결을 수행하는 새로운 프레임워크인 FPS(형식적 문제 해결)와, 해결과 답변 검증을 분리하여 인간과의 조율을 개선한 D-FPS(연역적 FPS)를 포함한다. 이 프레임워크들의 표현력, 건전성, 완전성이 입증되었다. 우리는 문제 해결을 위한 세 가지 벤치마크를 구성하였다: MATH500 벤치마크의 일부를 형식화한 FormalMath500; FTP 벤치마크인 MiniF2F와 PutnamBench를 적응시킨 MiniF2F-Solving과 PutnamBench-Solving. 신뢰할 수 있고 해석 가능하며 인간과 조율된 평가를 위해, 우리는 형식적 검증을 통해 답변의 정확성을 판단하는 기호적 접근법인 RPE(제한된 명제 동치)를 제안한다. 우리는 네 가지 주요 FTP 모델과 두 가지 프롬프팅 방법을 기준으로 평가하여, FormalMath500의 최대 23.77%, MiniF2F-Solving의 27.47%, PutnamBench-Solving의 0.31%를 해결하였다.

English

As a seemingly self-explanatory task, problem-solving has been a significant component of science and engineering. However, a general yet concrete formulation of problem-solving itself is missing. With the recent development of AI-based problem-solving agents, the demand for process-level verifiability is rapidly increasing yet underexplored. To fill these gaps, we present a principled formulation of problem-solving as a deterministic Markov decision process; a novel framework, FPS (Formal Problem-Solving), which utilizes existing FTP (formal theorem proving) environments to perform process-verified problem-solving; and D-FPS (Deductive FPS), decoupling solving and answer verification for better human-alignment. The expressiveness, soundness and completeness of the frameworks are proven. We construct three benchmarks on problem-solving: FormalMath500, a formalization of a subset of the MATH500 benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP benchmarks MiniF2F and PutnamBench. For faithful, interpretable, and human-aligned evaluation, we propose RPE (Restricted Propositional Equivalence), a symbolic approach to determine the correctness of answers by formal verification. We evaluate four prevalent FTP models and two prompting methods as baselines, solving at most 23.77% of FormalMath500, 27.47% of MiniF2F-Solving, and 0.31% of PutnamBench-Solving.

정리 증명을 넘어서: 형식적 문제 해결을 위한 공식화, 프레임워크 및 벤치마크

Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving

초록

Support