MathCoder-VL：融合视觉与代码，提升多模态数学推理能力

摘要

广泛应用于训练大型多模态模型的自然语言图像描述数据集，主要聚焦于自然场景，却忽视了数学图形中对于解题至关重要的复杂细节，这阻碍了当前多模态模型在数学推理领域的进步。为此，我们提出利用代码作为跨模态对齐的监督信号，因为代码本身包含了生成相应图形所需的所有信息，从而在两种模态间建立了精确的联系。具体而言，我们采用模型在环的方法共同开发了图像到代码的模型及数据集，最终得到了图像到代码模型FigCodifier和迄今为止最大的图像代码数据集ImgCode-8.6M。此外，我们利用FigCodifier合成了新的数学图形，进而构建了高质量的多模态数学指令微调数据集MM-MathInstruct-3M。最后，我们推出了MathCoder-VL模型，该模型首先使用ImgCode-8.6M进行跨模态对齐训练，随后在MM-MathInstruct-3M上进行多模态数学问题求解的微调。我们的模型在所有六项指标上均达到了开源领域的新SOTA水平。尤为突出的是，在MathVista的几何问题求解子集中，它超越了GPT-4o和Claude 3.5 Sonnet，分别实现了8.9%和9.2%的提升。数据集和模型将在https://github.com/mathllm/MathCoder发布。

English

Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.