MMFormalizer:多模態自動形式化實境應用
MMFormalizer: Multimodal Autoformalization in the Wild
January 6, 2026
作者: Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong
cs.AI
摘要
自動形式化技術旨在將自然語言數學轉換為形式化陳述以實現機器推理,但在現實世界中因物理世界的多模態特性面臨根本性挑戰——物理學常需從視覺元素推斷隱含約束(如質量或能量)。為此,我們提出MMFormalizer,通過整合現實世界數學與物理領域的實體自適應接地技術,將自動形式化擴展至文本之外。該方法通過遞歸接地與公理組合,從感知接地的基元遞歸構建形式化命題,並以自適應遞歸終止機制確保每個抽象概念均獲得視覺證據支持,並錨定於維度或公理基礎。我們在新型基準測試集PhyX-AF上評估MMFormalizer,該數據集包含從MathVerse、PhyX、合成幾何與解析幾何中精選的115個樣本,涵蓋多樣化多模態自動形式化任務。結果表明,GPT-5和Gemini-3-Pro等前沿模型在編譯與語義準確性上表現最佳,其中GPT-5在物理推理方面尤為突出,而幾何領域仍是最具挑戰性的方向。總體而言,MMFormalizer為統一多模態自動形式化提供了可擴展框架,橋接了感知與形式化推理。據我們所知,這是首個能處理經典力學(源自哈密頓量)以及相對論、量子力學和熱力學的多模態自動形式化方法。更多細節請訪問項目頁面:MMFormalizer.github.io
English
Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io