ChatPaper.aiChatPaper

迷你F2F精益再审视:剖析局限性与规划前行之路

miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

November 5, 2025
作者: Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh
cs.AI

摘要

我们针对miniF2F基准测试中的形式化与非形式化命题,从参与数学奥林匹克竞赛的AI系统视角展开深入分析。在该场景下,模型需阅读理解自然语言描述的题目,用Lean语言进行形式化表述,继而完成证明任务——当形式化证明与原始非形式化命题相符时,模型即可获得相应分数。评估结果表明:采用文献中现有最优模型时,该流程的最高准确率约为36%,远低于自动形式化与定理证明文献中分别报告的97%和69%的单项最优准确率。通过分析错误模式,我们发现超过半数题目的形式化与非形式化命题之间存在差异,这是导致准确率显著下降的主要原因。为此,我们修正了形式化与非形式化命题中的所有错误、差异及简化问题,推出包含完全验证的形式化/非形式化命题及证明的miniF2F-v2。在新版本上评估完整定理证明流程后,最佳准确率提升至70%,较原版miniF2F的40%实现显著进步,但仍反映出自动形式化模型与定理证明器之间存在明显偏差。深度分析表明,更高质量的基准测试有助于学界更精准评估形式推理领域的进展,并能更好诊断自动形式化与定理证明模型的成败模式。数据集详见https://github.com/roozbeh-yz/miniF2F_v2。
English
We perform a thorough analysis of the formal and informal statements in the miniF2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in miniF2F. In such setting, the model has to read and comprehend the problems in natural language, formalize them in Lean language, then proceed with proving the problems, and it will get credit for each problem if the formal proof corresponds to the original informal statement presented to the model. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the SoTA models in the literature, considerably lower than the individual SoTA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. Analyzing the failure modes, we trace back a considerable portion of this drop to discrepancies between the formal and informal statements for more than half of the problems in miniF2F. We proceed with correcting all the errors, discrepancies and simplifications in formal and informal statements, and present the miniF2F-v2 with fully verified formal and informal statements and proofs. Evaluating the full theorem proving pipeline on miniF2F-v2 leads to the best accuracy of 70%, a significant improvement from the 40% on the original miniF2F, yet indicating considerable misalignment between the autoformalization models and theorem provers. Our deep analysis suggests that a higher quality benchmark can help the community better evaluate progress in the field of formal reasoning and also better diagnose the failure and success modes of autoformalization and theorem proving models. Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2.
PDF22December 1, 2025