ChatPaper.aiChatPaper

miniF2F-Lean 再审视:局限分析与未来路径规划

miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

November 5, 2025
作者: Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh
cs.AI

摘要

我们从一个参与数学奥林匹克竞赛的AI系统视角,对miniF2F基准测试中的形式化与非形式化命题进行了全面分析。该竞赛包含miniF2F中的所有题目,模型需要完成以下流程:阅读理解自然语言描述的题目、将其形式化为Lean语言、接着进行定理证明。只有当形式化证明与原始非形式化命题相符时,模型才能获得相应题目的分数。评估结果显示,采用文献中的最先进模型时,该流程的最高准确率约为36%,远低于自动形式化和定理证明领域分别报告的97%和69%的单项最优准确率。 通过分析错误模式,我们发现准确率下降的主要根源在于:超过半数的miniF2F题目存在形式化与非形式化命题之间的差异。我们系统修正了形式化与非形式化命题中的所有错误、差异及简化问题,推出了包含完整验证的形式化/非形式化命题及证明的miniF2F-v2。在该升级版本上进行全流程定理证明评估后,最佳准确率提升至70%,较原版miniF2F的40%实现显著进步,但仍反映出自动形式化模型与定理证明器之间存在显著偏差。 深入分析表明,更高质量的基准测试有助于学界更准确地评估形式推理领域的进展,并能更精准地诊断自动形式化与定理证明模型的成败模式。本数据集已发布于https://github.com/roozbeh-yz/miniF2F_v2。
English
We perform a thorough analysis of the formal and informal statements in the miniF2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in miniF2F. In such setting, the model has to read and comprehend the problems in natural language, formalize them in Lean language, then proceed with proving the problems, and it will get credit for each problem if the formal proof corresponds to the original informal statement presented to the model. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the SoTA models in the literature, considerably lower than the individual SoTA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. Analyzing the failure modes, we trace back a considerable portion of this drop to discrepancies between the formal and informal statements for more than half of the problems in miniF2F. We proceed with correcting all the errors, discrepancies and simplifications in formal and informal statements, and present the miniF2F-v2 with fully verified formal and informal statements and proofs. Evaluating the full theorem proving pipeline on miniF2F-v2 leads to the best accuracy of 70%, a significant improvement from the 40% on the original miniF2F, yet indicating considerable misalignment between the autoformalization models and theorem provers. Our deep analysis suggests that a higher quality benchmark can help the community better evaluate progress in the field of formal reasoning and also better diagnose the failure and success modes of autoformalization and theorem proving models. Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2.
PDF22December 1, 2025