MathCoder-VL：融合視覺與代碼以強化多模態數學推理

摘要

廣泛用於訓練大型多模態模型的自然語言圖像描述數據集，主要聚焦於自然場景，而忽視了數學圖像中對解題至關重要的複雜細節，這阻礙了當前多模態模型在多模態數學推理方面的進展。為此，我們提出利用代碼作為跨模態對齊的監督信號，因為代碼內在地編碼了生成相應圖像所需的所有信息，從而建立了兩種模態之間的精確聯繫。具體而言，我們採用模型在環路的方法共同開發了圖像到代碼的模型和數據集，最終得到了圖像到代碼模型FigCodifier以及迄今為止最大的圖像-代碼數據集ImgCode-8.6M。此外，我們利用FigCodifier合成了新的數學圖像，並構建了高質量的多模態數學指令微調數據集MM-MathInstruct-3M。最後，我們展示了MathCoder-VL，該模型首先使用ImgCode-8.6M進行跨模態對齊訓練，隨後在MM-MathInstruct-3M上進行微調以解決多模態數學問題。我們的模型在所有六項指標上均達到了開源領域的新SOTA水平。值得注意的是，在MathVista的幾何問題解決子集中，它超越了GPT-4o和Claude 3.5 Sonnet，分別實現了8.9%和9.2%的提升。數據集和模型將在https://github.com/mathllm/MathCoder上發布。

English

Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.

MathCoder-VL：融合視覺與代碼以強化多模態數學推理

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

摘要

Support