ChatPaper.aiChatPaper

超越定理證明:形式化問題解決的構建、框架與基準

Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving

May 7, 2025
作者: Qi Liu, Xinhao Zheng, Renqiu Xia, Xingzhi Qi, Qinxiang Cao, Junchi Yan
cs.AI

摘要

作為一項看似不言自明的任務,問題解決一直是科學與工程領域的重要組成部分。然而,對於問題解決本身,尚缺乏一個通用且具體的定義。隨著基於人工智能的問題解決代理的近期發展,對過程層面可驗證性的需求迅速增長,但這一領域仍未被充分探索。為填補這些空白,我們提出了一種基於確定性馬爾可夫決策過程的問題解決原則性定義;一個新穎的框架——FPS(形式化問題解決),該框架利用現有的FTP(形式化定理證明)環境來執行過程驗證的問題解決;以及D-FPS(演繹式FPS),它將解決過程與答案驗證分離,以實現更好的人類對齊。我們證明了這些框架的表達能力、健全性和完備性。我們構建了三個關於問題解決的基準測試:FormalMath500,它是MATH500基準測試子集的形式化版本;MiniF2F-Solving和PutnamBench-Solving,這兩個是FTP基準測試MiniF2F和PutnamBench的改編版本。為了進行忠實、可解釋且與人類對齊的評估,我們提出了RPE(受限命題等價性),這是一種通過形式化驗證來確定答案正確性的符號化方法。我們評估了四種流行的FTP模型和兩種提示方法作為基線,最多解決了FormalMath500的23.77%、MiniF2F-Solving的27.47%以及PutnamBench-Solving的0.31%。
English
As a seemingly self-explanatory task, problem-solving has been a significant component of science and engineering. However, a general yet concrete formulation of problem-solving itself is missing. With the recent development of AI-based problem-solving agents, the demand for process-level verifiability is rapidly increasing yet underexplored. To fill these gaps, we present a principled formulation of problem-solving as a deterministic Markov decision process; a novel framework, FPS (Formal Problem-Solving), which utilizes existing FTP (formal theorem proving) environments to perform process-verified problem-solving; and D-FPS (Deductive FPS), decoupling solving and answer verification for better human-alignment. The expressiveness, soundness and completeness of the frameworks are proven. We construct three benchmarks on problem-solving: FormalMath500, a formalization of a subset of the MATH500 benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP benchmarks MiniF2F and PutnamBench. For faithful, interpretable, and human-aligned evaluation, we propose RPE (Restricted Propositional Equivalence), a symbolic approach to determine the correctness of answers by formal verification. We evaluate four prevalent FTP models and two prompting methods as baselines, solving at most 23.77% of FormalMath500, 27.47% of MiniF2F-Solving, and 0.31% of PutnamBench-Solving.

Summary

AI-Generated Summary

PDF101May 8, 2025