MAVIS: 数学的視覚的指示チューニング

要旨

マルチモーダル大規模言語モデル（MLLMs）は、近年、学界と産業界において重要な焦点として浮上しています。一般的なマルチモーダルシナリオにおける熟練度にもかかわらず、視覚的文脈における数学的問題解決能力は十分に探求されていません。私たちは、MLLMs内で改善が必要な3つの主要な領域を特定しました：数学図形の視覚的エンコーディング、図形と言語のアラインメント、および数学的推論スキルです。これにより、視覚的数学における大規模で高品質なデータとトレーニングパイプラインの緊急の需要が引き起こされています。本論文では、MLLMsのための最初の数学的視覚的指示チューニングパラダイムであるMAVISを提案します。これは、一連の数学的視覚データセットと専門的なMLLMsを含みます。3つの問題をターゲットに、MAVISはゼロから始まる3つの段階的なトレーニングステージを含んでいます。まず、558Kの図形-キャプションペアからなるMAVIS-Captionをキュレーションし、コントラスティブラーニングを通じて数学特化の視覚エンコーダ（CLIP-Math）を微調整し、図形の視覚的エンコーディングを改善します。次に、MAVIS-Captionを利用して、CLIP-Mathと大規模言語モデル（LLM）を投影層によってアラインメントし、数学的ドメインにおける視覚-言語アラインメントを強化します。最後に、900Kの注意深く収集され注釈が付けられた視覚的数学問題を含むMAVIS-Instructを導入し、MLLMを最終的に指示チューニングして、堅牢な数学的推論スキルを獲得します。MAVIS-Instructでは、各問題に対して完全な連鎖的思考（CoT）の根拠を組み込み、テキストの冗長性を最小限に抑えることで、モデルを視覚要素に集中させます。データとモデルはhttps://github.com/ZrrSkywalker/MAVISで公開されています。

English

Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

MAVIS: 数学的視覚的指示チューニング

MAVIS: Mathematical Visual Instruction Tuning

要旨

Support