MAVIS: Wiskundige Visuele Instructieafstemming

Samenvatting

Multi-modale Large Language Models (MLLMs) zijn recentelijk naar voren gekomen als een belangrijk aandachtspunt in zowel de academische wereld als de industrie. Ondanks hun vaardigheid in algemene multi-modale scenario's, blijven de wiskundige probleemoplossende capaciteiten in visuele contexten onvoldoende onderzocht. Wij identificeren drie belangrijke gebieden binnen MLLMs die verbeterd moeten worden: visuele codering van wiskundige diagrammen, diagram-taal-alignment en wiskundige redeneervaardigheden. Dit leidt tot een dringende behoefte aan grootschalige, hoogwaardige data en trainingspijplijnen in visuele wiskunde. In dit artikel stellen wij MAVIS voor, het eerste MAthematical VISual instruction tuning paradigma voor MLLMs, waarbij een reeks wiskundige visuele datasets en gespecialiseerde MLLMs betrokken zijn. Gericht op de drie genoemde problemen, bevat MAVIS drie progressieve trainingsfasen vanaf nul. Eerst stellen wij MAVIS-Caption samen, bestaande uit 558K diagram-bijschrift paren, om een wiskunde-specifieke visuele encoder (CLIP-Math) te finetunen via contrastief leren, afgestemd op verbeterde visuele codering van diagrammen. Vervolgens gebruiken wij MAVIS-Caption om de CLIP-Math uit te lijnen met een large language model (LLM) door middel van een projectielaag, waardoor het vision-taal-alignment in wiskundige domeinen wordt verbeterd. Ten derde introduceren wij MAVIS-Instruct, inclusief 900K zorgvuldig verzamelde en geannoteerde visuele wiskundige problemen, die worden gebruikt om uiteindelijk de MLLM te instruct-tunen voor robuuste wiskundige redeneervaardigheden. In MAVIS-Instruct incorporeren wij complete chain-of-thought (CoT) redeneringen voor elk probleem en minimaliseren wij tekstuele redundantie, waardoor het model zich meer concentreert op de visuele elementen. Data en modellen zijn vrijgegeven op https://github.com/ZrrSkywalker/MAVIS.

English

Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

MAVIS: Wiskundige Visuele Instructieafstemming

MAVIS: Mathematical Visual Instruction Tuning

Samenvatting

Support