MathCoder-VL: 강화된 다중모드 수학적 추론을 위한 시각과 코드의 연결

초록

대규모 다중모달 모델(Large Multimodal Models, LMM) 훈련에 널리 사용되는 자연어 이미지 캡션 데이터셋은 주로 자연스러운 시나리오에 초점을 맞추며, 문제 해결에 중요한 수학적 도형의 복잡한 세부 사항을 간과하여 현재 LMM의 다중모달 수학적 추론 발전을 저해하고 있다. 이를 위해 우리는 코드를 교차모달 정렬을 위한 감독으로 활용할 것을 제안한다. 코드는 해당 도형을 생성하는 데 필요한 모든 정보를 내재적으로 인코딩하므로 두 모달리티 간의 정확한 연결을 확립할 수 있기 때문이다. 구체적으로, 우리는 모델-인-더-루프(model-in-the-loop) 접근법을 통해 이미지-투-코드 모델과 데이터셋을 공동 개발하여, FigCodifier라는 이미지-투-코드 모델과 현재까지 가장 큰 이미지-코드 데이터셋인 ImgCode-8.6M을 구축하였다. 더 나아가, FigCodifier를 활용하여 새로운 수학적 도형을 합성하고, 고품질 다중모달 수학 지시 미세조정 데이터셋인 MM-MathInstruct-3M을 구성하였다. 마지막으로, 교차모달 정렬을 위해 ImgCode-8.6M으로 훈련되고, 다중모달 수학 문제 해결을 위해 MM-MathInstruct-3M에서 미세조정된 MathCoder-VL을 제시한다. 우리의 모델은 모든 6가지 메트릭에서 새로운 오픈소스 SOTA(State-of-the-Art)를 달성하였다. 특히, MathVista의 기하학 문제 해결 하위 집합에서 GPT-4o와 Claude 3.5 Sonnet을 각각 8.9%와 9.2% 앞섰다. 데이터셋과 모델은 https://github.com/mathllm/MathCoder에서 공개될 예정이다.

English

Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.

MathCoder-VL: 강화된 다중모드 수학적 추론을 위한 시각과 코드의 연결

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

초록

Support