超越定理证明:形式化问题求解的构建、框架与基准
Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving
May 7, 2025
作者: Qi Liu, Xinhao Zheng, Renqiu Xia, Xingzhi Qi, Qinxiang Cao, Junchi Yan
cs.AI
摘要
作为一项看似不言自明的任务,问题解决一直是科学与工程领域的重要组成部分。然而,关于问题解决本身,一个普遍且具体的定义却尚付阙如。随着基于人工智能的问题解决代理的近期发展,对过程层面可验证性的需求迅速增长,但这一领域仍未被充分探索。为填补这些空白,我们提出了一种将问题解决形式化为确定性马尔可夫决策过程的原理性框架;一个名为FPS(形式化问题解决)的新颖框架,它利用现有的FTP(形式化定理证明)环境来执行过程验证的问题解决;以及D-FPS(演绎式FPS),它将求解与答案验证解耦,以实现更好的人机对齐。我们证明了这些框架的表达力、可靠性和完备性。我们构建了三个关于问题解决的基准测试:FormalMath500,即MATH500基准测试子集的形式化版本;MiniF2F-Solving和PutnamBench-Solving,分别是FTP基准测试MiniF2F和PutnamBench的改编版本。为了进行忠实、可解释且人机对齐的评估,我们提出了RPE(受限命题等价性),这是一种通过形式验证来确定答案正确性的符号方法。我们评估了四种流行的FTP模型和两种提示方法作为基线,分别解决了FormalMath500最多23.77%的问题,MiniF2F-Solving的27.47%,以及PutnamBench-Solving的0.31%。
English
As a seemingly self-explanatory task, problem-solving has been a significant
component of science and engineering. However, a general yet concrete
formulation of problem-solving itself is missing. With the recent development
of AI-based problem-solving agents, the demand for process-level verifiability
is rapidly increasing yet underexplored. To fill these gaps, we present a
principled formulation of problem-solving as a deterministic Markov decision
process; a novel framework, FPS (Formal Problem-Solving), which utilizes
existing FTP (formal theorem proving) environments to perform process-verified
problem-solving; and D-FPS (Deductive FPS), decoupling solving and answer
verification for better human-alignment. The expressiveness, soundness and
completeness of the frameworks are proven. We construct three benchmarks on
problem-solving: FormalMath500, a formalization of a subset of the MATH500
benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP
benchmarks MiniF2F and PutnamBench. For faithful, interpretable, and
human-aligned evaluation, we propose RPE (Restricted Propositional
Equivalence), a symbolic approach to determine the correctness of answers by
formal verification. We evaluate four prevalent FTP models and two prompting
methods as baselines, solving at most 23.77% of FormalMath500, 27.47% of
MiniF2F-Solving, and 0.31% of PutnamBench-Solving.Summary
AI-Generated Summary