MAVIS：數學視覺指導調校

摘要

多模式大型語言模型（MLLMs）最近在學術界和工業界備受關注。儘管它們在一般多模式情境中表現出色，但在視覺情境中的數學問題解決能力尚未得到充分探索。我們確定了MLLMs中需要改進的三個關鍵領域：數學圖表的視覺編碼、圖表-語言對齊以及數學推理能力。這引出了對視覺數學中大規模、高質量數據和訓練管道的迫切需求。在本文中，我們提出了MAVIS，這是第一個針對MLLMs的數學視覺指導調整範式，涉及一系列數學視覺數據集和專門的MLLMs。針對這三個問題，MAVIS包含三個從頭開始的逐步訓練階段。首先，我們精心挑選了MAVIS-Caption，其中包含558K個圖表-標題對，通過對比學習來微調一個針對改進圖表視覺編碼的數學特定視覺編碼器（CLIP-Math）。其次，我們利用MAVIS-Caption將CLIP-Math與大型語言模型（LLM）通過一個投影層對齊，增強數學領域中的視覺-語言對齊。第三，我們引入了MAVIS-Instruct，其中包括精心收集和標註的90萬個視覺數學問題，用於最終指導調整MLLM以提高穩健的數學推理能力。在MAVIS-Instruct中，我們為每個問題納入完整的思維鏈（CoT）理由，並最小化文本冗余，從而使模型專注於視覺元素。數據和模型可在https://github.com/ZrrSkywalker/MAVIS 上獲得。

English

Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

MAVIS：數學視覺指導調校

MAVIS: Mathematical Visual Instruction Tuning

摘要

Support