MAVIS: Mathematische Visuelle Instruktionsabstimmung

papers.abstract

Multi-Modale Große Sprachmodelle (MMGSM) sind kürzlich als ein bedeutender Schwerpunkt in der akademischen Welt und der Industrie aufgetaucht. Trotz ihrer Effizienz in allgemeinen multi-modalen Szenarien sind die mathematischen Problemlösungsfähigkeiten in visuellen Kontexten noch unzureichend erforscht. Wir identifizieren drei Schlüsselbereiche innerhalb von MMGSM, die verbessert werden müssen: visuelle Kodierung von mathematischen Diagrammen, Diagramm-Sprachausrichtung und mathematische Schlussfolgerungsfähigkeiten. Dies führt zu einem dringenden Bedarf an umfangreichen, hochwertigen Daten und Trainingspipelines im Bereich der visuellen Mathematik. In diesem Artikel schlagen wir MAVIS vor, das erste MAthematical VISual Instruktionstuning-Paradigma für MMGSM, das eine Reihe von mathematischen visuellen Datensätzen und spezialisierten MMGSM umfasst. Um die drei Probleme anzugehen, enthält MAVIS drei aufeinander aufbauende Trainingsstufen von Grund auf. Zunächst kuratieren wir MAVIS-Caption, bestehend aus 558K Diagramm-Beschriftungspaaren, um einen mathematikspezifischen Vision-Encoder (CLIP-Math) durch kontrastives Lernen zu feinabstimmen, der für eine verbesserte visuelle Kodierung von Diagrammen maßgeschneidert ist. Zweitens nutzen wir MAVIS-Caption, um den CLIP-Math mit einem großen Sprachmodell (LLM) durch eine Projektionsschicht auszurichten, um die Vision-Sprach-Ausrichtung in mathematischen Bereichen zu verbessern. Drittens führen wir MAVIS-Instruct ein, das 900K sorgfältig gesammelte und annotierte visuelle mathematische Probleme umfasst, die angenommen werden, um schließlich das MMGSM für robuste mathematische Schlussfolgerungsfähigkeiten zu instruktionstunen. In MAVIS-Instruct integrieren wir vollständige Chain-of-Thought (CoT) Begründungen für jedes Problem und minimieren textuelle Redundanz, wodurch das Modell auf die visuellen Elemente konzentriert wird. Daten und Modelle sind unter https://github.com/ZrrSkywalker/MAVIS veröffentlicht.

English

Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

MAVIS: Mathematische Visuelle Instruktionsabstimmung

MAVIS: Mathematical Visual Instruction Tuning

papers.abstract

Support