MAVIS：数学视觉指导调整

摘要

多模态大型语言模型（MLLMs）最近成为学术界和工业界的重要关注焦点。尽管它们在一般多模态场景中表现出色，但在视觉背景下的数学问题解决能力尚未得到充分探索。我们确定了MLLMs中需要改进的三个关键领域：数学图表的视觉编码、图表-语言对齐以及数学推理能力。这引发了对视觉数学领域大规模、高质量数据和训练流程的迫切需求。在本文中，我们提出了MAVIS，这是第一个针对MLLMs的数学视觉指导调优范式，涉及一系列数学视觉数据集和专门的MLLMs。针对这三个问题，MAVIS包含了三个从头开始的渐进训练阶段。首先，我们策划了MAVIS-Caption，包括558K个图表-标题对，通过对比学习来微调一个针对改进图表视觉编码的数学特定视觉编码器（CLIP-Math）。其次，我们利用MAVIS-Caption将CLIP-Math与大型语言模型（LLM）通过一个投影层对齐，增强数学领域中的视觉-语言对齐。第三，我们引入了MAVIS-Instruct，包括精心收集和注释的90万个视觉数学问题，用于最终指导调优MLLM以获得强大的数学推理能力。在MAVIS-Instruct中，我们为每个问题加入了完整的思维链（CoT）理由，并最小化了文本冗余，从而使模型集中于视觉元素。数据和模型发布在https://github.com/ZrrSkywalker/MAVIS。

English

Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

MAVIS：数学视觉指导调整

MAVIS: Mathematical Visual Instruction Tuning

摘要

Support