MAVIS: 수학적 시각적 명령어 튜닝

초록

다중 모달 대형 언어 모델(Multi-modal Large Language Models, MLLMs)은 최근 학계와 산업계에서 중요한 주목을 받고 있습니다. 이러한 모델들은 일반적인 다중 모달 시나리오에서 뛰어난 성능을 보이지만, 시각적 맥락에서의 수학 문제 해결 능력은 아직 충분히 탐구되지 않았습니다. 우리는 MLLMs 내에서 개선이 필요한 세 가지 주요 영역을 식별했습니다: 수학 다이어그램의 시각적 인코딩, 다이어그램-언어 정렬, 그리고 수학적 추론 능력입니다. 이는 시각적 수학 분야에서 대규모의 고품질 데이터와 훈련 파이프라인의 긴급한 필요성을 제기합니다. 본 논문에서 우리는 MLLMs를 위한 첫 번째 수학적 시각적 지침 튜닝 패러다임인 MAVIS를 제안합니다. MAVIS는 일련의 수학적 시각 데이터셋과 특화된 MLLMs를 포함합니다. 세 가지 문제를 해결하기 위해 MAVIS는 처음부터 세 단계의 점진적인 훈련 단계를 포함합니다. 첫째, 558K개의 다이어그램-캡션 쌍으로 구성된 MAVIS-Caption을 통해 대조 학습을 통해 수학 특화 시각 인코더(CLIP-Math)를 미세 조정하여 다이어그램 시각적 인코딩을 개선합니다. 둘째, MAVIS-Caption을 활용하여 CLIP-Math와 대형 언어 모델(LLM)을 투영 계층을 통해 정렬하여 수학적 도메인에서의 시각-언어 정렬을 강화합니다. 셋째, 900K개의 세심하게 수집되고 주석이 달린 시각적 수학 문제를 포함하는 MAVIS-Instruct를 도입하여, 최종적으로 MLLM을 지시 튜닝하여 견고한 수학적 추론 능력을 갖추도록 합니다. MAVIS-Instruct에서는 각 문제에 대한 완전한 사고 과정(Chain-of-Thought, CoT) 논리를 포함하고, 텍스트적 중복을 최소화하여 모델이 시각적 요소에 집중하도록 합니다. 데이터와 모델은 https://github.com/ZrrSkywalker/MAVIS에서 공개됩니다.

English

Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

MAVIS: 수학적 시각적 명령어 튜닝

MAVIS: Mathematical Visual Instruction Tuning

초록

Support