Math-LLaVA:为多模态大语言模型引导数学推理
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
June 25, 2024
作者: Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, Roy Ka-Wei Lee
cs.AI
摘要
大型语言模型(LLMs)展示了令人印象深刻的推理能力,特别是在文本数学问题求解方面。然而,现有的开源图像指导微调数据集,每个图像包含的问题-答案对有限,未能充分利用视觉信息来增强多模态语言模型(MLLMs)的数学推理能力。为弥补这一差距,我们通过收集来自24个现有数据集的40K高质量图像及问题-答案对,并合成320K新对,创建了MathV360K数据集,提升了多模态数学问题的广度和深度。我们引入了Math-LLaVA,这是一个基于LLaVA-1.5的模型,经过MathV360K微调。这种新颖方法显著提高了LLaVA-1.5的多模态数学推理能力,使其在MathVista的minitest分割上实现了19点增长,并表现出与GPT-4V可比的性能。此外,Math-LLaVA展示了增强的泛化能力,在MMMU基准测试中显示出显著改进。我们的研究突出了数据集多样性和合成在提升MLLMs数学推理能力方面的重要性。代码和数据可在以下链接获取:https://github.com/HZQ950419/Math-LLaVA。
English
Large language models (LLMs) have demonstrated impressive reasoning
capabilities, particularly in textual mathematical problem-solving. However,
existing open-source image instruction fine-tuning datasets, containing limited
question-answer pairs per image, do not fully exploit visual information to
enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs
(MLLMs). To bridge this gap, we address the lack of high-quality, diverse
multimodal mathematical datasets by collecting 40K high-quality images with
question-answer pairs from 24 existing datasets and synthesizing 320K new
pairs, creating the MathV360K dataset, which enhances both the breadth and
depth of multimodal mathematical questions. We introduce Math-LLaVA, a
LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach
significantly improves the multimodal mathematical reasoning capabilities of
LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V
on MathVista's minitest split. Furthermore, Math-LLaVA demonstrates enhanced
generalizability, showing substantial improvements on the MMMU benchmark. Our
research highlights the importance of dataset diversity and synthesis in
advancing MLLMs' mathematical reasoning abilities. The code and data are
available at: https://github.com/HZQ950419/Math-LLaVA.Summary
AI-Generated Summary