MMFormalizer:多模态自动形式化实战系统
MMFormalizer: Multimodal Autoformalization in the Wild
January 6, 2026
作者: Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong
cs.AI
摘要
自动形式化旨在将自然语言数学转化为形式化陈述以实现机器推理,但由于物理世界的多模态特性(如需要通过视觉元素推断质量、能量等隐藏约束),该技术在现实应用中面临根本性挑战。为此,我们提出MMFormalizer,通过将自适应实体关联与真实世界的数学物理领域相结合,将自动形式化从纯文本扩展到多模态场景。该方法通过递归关联和公理组合,从感知关联的基元出发递归构建形式化命题,其中自适应递归终止机制确保每个抽象概念都有视觉证据支持,并锚定在维度或公理基础上。我们在新基准测试集PhyX-AF(包含从MathVerse、PhyX、综合几何与解析几何中精选的115个样本)上评估MMFormalizer,覆盖多样化的多模态自动形式化任务。结果表明,GPT-5和Gemini-3-Pro等前沿模型在编译准确率与语义准确率上表现最佳,其中GPT-5在物理推理方面尤为突出,而几何领域仍是最大挑战。总体而言,MMFormalizer为统一的多模态自动形式化提供了可扩展框架,架起了感知与形式化推理的桥梁。据我们所知,这是首个能处理经典力学(源自哈密顿量)以及相对论、量子力学和热力学的多模态自动形式化方法。更多细节请访问项目页面:MMFormalizer.github.io
English
Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io