MathCoder-VL:融合視覺與代碼以強化多模態數學推理
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
May 15, 2025
作者: Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li
cs.AI
摘要
廣泛用於訓練大型多模態模型的自然語言圖像描述數據集,主要聚焦於自然場景,而忽視了數學圖像中對解題至關重要的複雜細節,這阻礙了當前多模態模型在多模態數學推理方面的進展。為此,我們提出利用代碼作為跨模態對齊的監督信號,因為代碼內在地編碼了生成相應圖像所需的所有信息,從而建立了兩種模態之間的精確聯繫。具體而言,我們採用模型在環路的方法共同開發了圖像到代碼的模型和數據集,最終得到了圖像到代碼模型FigCodifier以及迄今為止最大的圖像-代碼數據集ImgCode-8.6M。此外,我們利用FigCodifier合成了新的數學圖像,並構建了高質量的多模態數學指令微調數據集MM-MathInstruct-3M。最後,我們展示了MathCoder-VL,該模型首先使用ImgCode-8.6M進行跨模態對齊訓練,隨後在MM-MathInstruct-3M上進行微調以解決多模態數學問題。我們的模型在所有六項指標上均達到了開源領域的新SOTA水平。值得注意的是,在MathVista的幾何問題解決子集中,它超越了GPT-4o和Claude 3.5 Sonnet,分別實現了8.9%和9.2%的提升。數據集和模型將在https://github.com/mathllm/MathCoder上發布。
English
Natural language image-caption datasets, widely used for training Large
Multimodal Models, mainly focus on natural scenarios and overlook the intricate
details of mathematical figures that are critical for problem-solving,
hindering the advancement of current LMMs in multimodal mathematical reasoning.
To this end, we propose leveraging code as supervision for cross-modal
alignment, since code inherently encodes all information needed to generate
corresponding figures, establishing a precise connection between the two
modalities. Specifically, we co-develop our image-to-code model and dataset
with model-in-the-loop approach, resulting in an image-to-code model,
FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date.
Furthermore, we utilize FigCodifier to synthesize novel mathematical figures
and then construct MM-MathInstruct-3M, a high-quality multimodal math
instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with
ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on
MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a
new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and
Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista,
achieving improvements of 8.9% and 9.2%. The dataset and models will be
released at https://github.com/mathllm/MathCoder.