ChatPaper.aiChatPaper

MathCoder-VL:融合視覺與代碼以強化多模態數學推理

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

May 15, 2025
作者: Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li
cs.AI

摘要

廣泛用於訓練大型多模態模型的自然語言圖像描述數據集,主要聚焦於自然場景,而忽視了數學圖像中對解題至關重要的複雜細節,這阻礙了當前多模態模型在多模態數學推理方面的進展。為此,我們提出利用代碼作為跨模態對齊的監督信號,因為代碼內在地編碼了生成相應圖像所需的所有信息,從而建立了兩種模態之間的精確聯繫。具體而言,我們採用模型在環路的方法共同開發了圖像到代碼的模型和數據集,最終得到了圖像到代碼模型FigCodifier以及迄今為止最大的圖像-代碼數據集ImgCode-8.6M。此外,我們利用FigCodifier合成了新的數學圖像,並構建了高質量的多模態數學指令微調數據集MM-MathInstruct-3M。最後,我們展示了MathCoder-VL,該模型首先使用ImgCode-8.6M進行跨模態對齊訓練,隨後在MM-MathInstruct-3M上進行微調以解決多模態數學問題。我們的模型在所有六項指標上均達到了開源領域的新SOTA水平。值得注意的是,在MathVista的幾何問題解決子集中,它超越了GPT-4o和Claude 3.5 Sonnet,分別實現了8.9%和9.2%的提升。數據集和模型將在https://github.com/mathllm/MathCoder上發布。
English
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.
PDF462May 16, 2025