ChatPaper.aiChatPaper

MathCoder-VL:融合视觉与代码,提升多模态数学推理能力

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

May 15, 2025
作者: Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li
cs.AI

摘要

广泛应用于训练大型多模态模型的自然语言图像描述数据集,主要聚焦于自然场景,却忽视了数学图形中对于解题至关重要的复杂细节,这阻碍了当前多模态模型在数学推理领域的进步。为此,我们提出利用代码作为跨模态对齐的监督信号,因为代码本身包含了生成相应图形所需的所有信息,从而在两种模态间建立了精确的联系。具体而言,我们采用模型在环的方法共同开发了图像到代码的模型及数据集,最终得到了图像到代码模型FigCodifier和迄今为止最大的图像代码数据集ImgCode-8.6M。此外,我们利用FigCodifier合成了新的数学图形,进而构建了高质量的多模态数学指令微调数据集MM-MathInstruct-3M。最后,我们推出了MathCoder-VL模型,该模型首先使用ImgCode-8.6M进行跨模态对齐训练,随后在MM-MathInstruct-3M上进行多模态数学问题求解的微调。我们的模型在所有六项指标上均达到了开源领域的新SOTA水平。尤为突出的是,在MathVista的几何问题求解子集中,它超越了GPT-4o和Claude 3.5 Sonnet,分别实现了8.9%和9.2%的提升。数据集和模型将在https://github.com/mathllm/MathCoder发布。
English
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.
PDF462May 16, 2025