MathVerse:您的多模式LLM是否真正看到了视觉数学问题中的图表?
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
March 21, 2024
作者: Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li
cs.AI
摘要
多模式大型语言模型(MLLMs)取得了显著进展,在视觉背景下表现出卓越性能,因此受到了空前关注。然而,它们在视觉数学问题解决方面的能力尚未得到充分评估和理解。我们调查了当前的基准测试,以在文本问题中包含过多的视觉内容,这可能有助于MLLMs在不真正解释输入图表的情况下推断答案。为此,我们引入了MathVerse,这是一个全面的视觉数学基准测试,旨在公平而深入地评估MLLMs。我们精心收集了来自公开来源的2,612个高质量、多学科的数学问题,并由人类注释员将每个问题转换为六个不同版本,每个版本提供不同程度的多模态信息内容,共贡献了15K个测试样本。这种方法使MathVerse能够全面评估MLLMs是否真正理解数学推理中的视觉图表,以及它们理解的程度。此外,我们提出了一种“思维链”(CoT)评估策略,用于对输出答案进行细粒度评估。我们不是简单地判断真或假,而是使用GPT-4(V)自适应地提取关键推理步骤,然后对每个步骤进行详细的错误分析,这可以揭示MLLMs的中间CoT推理质量。我们希望MathVerse基准测试可以提供独特的见解,指导未来MLLMs的发展。项目页面:https://mathverse-cuhk.github.io
English
The remarkable progress of Multi-modal Large Language Models (MLLMs) has
garnered unparalleled attention, due to their superior performance in visual
contexts. However, their capabilities in visual math problem-solving remain
insufficiently evaluated and understood. We investigate current benchmarks to
incorporate excessive visual content within textual questions, which
potentially assist MLLMs in deducing answers without truly interpreting the
input diagrams. To this end, we introduce MathVerse, an all-around visual math
benchmark designed for an equitable and in-depth evaluation of MLLMs. We
meticulously collect 2,612 high-quality, multi-subject math problems with
diagrams from publicly available sources. Each problem is then transformed by
human annotators into six distinct versions, each offering varying degrees of
information content in multi-modality, contributing to 15K test samples in
total. This approach allows MathVerse to comprehensively assess whether and how
much MLLMs can truly understand the visual diagrams for mathematical reasoning.
In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a
fine-grained assessment of the output answers. Rather than naively judging True
or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and
then score each step with detailed error analysis, which can reveal the
intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark
may provide unique insights to guide the future development of MLLMs. Project
page: https://mathverse-cuhk.github.ioSummary
AI-Generated Summary