Math-LLaVA: マルチモーダル大規模言語モデルのための数学的推論のブートストラップ

要旨

大規模言語モデル（LLM）は、特にテキストベースの数学的問題解決において、印象的な推論能力を示してきました。しかし、既存のオープンソースの画像指示ファインチューニングデータセットは、画像ごとに限られた質問-回答ペアしか含まれておらず、マルチモーダルLLM（MLLM）の多様な数学的推論能力を強化するために視覚情報を十分に活用していません。このギャップを埋めるため、我々は高品質で多様なマルチモーダル数学データセットの不足に対処し、24の既存データセットから40Kの高品質な画像と質問-回答ペアを収集し、さらに320Kの新しいペアを合成することで、MathV360Kデータセットを作成しました。これにより、マルチモーダル数学問題の幅と深さの両方が強化されました。我々は、MathV360KでファインチューニングされたLLaVA-1.5ベースのモデルであるMath-LLaVAを導入しました。この新しいアプローチにより、LLaVA-1.5のマルチモーダル数学推論能力が大幅に向上し、MathVistaのミニテスト分割において19ポイントの向上を達成し、GPT-4Vと同等の性能を示しました。さらに、Math-LLaVAは一般化能力が向上し、MMMUベンチマークにおいて大幅な改善を示しました。我々の研究は、MLLMの数学的推論能力を進歩させるためのデータセットの多様性と合成の重要性を強調しています。コードとデータは以下で公開されています: https://github.com/HZQ950419/Math-LLaVA。

English

Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista's minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs' mathematical reasoning abilities. The code and data are available at: https://github.com/HZQ950419/Math-LLaVA.

Math-LLaVA: マルチモーダル大規模言語モデルのための数学的推論のブートストラップ

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

要旨

Support