MathVerse:您的多模式LLM是否真正看懂視覺數學問題中的圖表?
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
March 21, 2024
作者: Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li
cs.AI
摘要
多模式大型語言模型(MLLMs)取得了顯著進展,在視覺背景下表現優異,因此受到空前的關注。然而,它們在視覺數學問題解決方面的能力尚未得到充分評估和理解。我們研究目前的基準,將過多的視覺內容融入文本問題中,這有助於MLLMs在不真正解釋輸入圖表的情況下推斷答案。為此,我們引入了MathVerse,這是一個全面的視覺數學基準,旨在公平且深入地評估MLLMs。我們精心收集了2,612個高質量、多學科的數學問題,並從公開來源中獲取了圖表。然後,每個問題由人類標註者轉換為六個不同版本,每個版本在多模式中提供不同程度的信息內容,總共貢獻了15K個測試樣本。這種方法使MathVerse能夠全面評估MLLMs是否真正理解視覺圖表以進行數學推理,以及它們能夠理解多少。此外,我們提出了一種“思維鏈”(CoT)評估策略,用於對輸出答案進行細緻評估。我們不僅僅是天真地判斷真或假,而是使用GPT-4(V)來自適應性地提取關鍵的推理步驟,然後對每個步驟進行詳細的錯誤分析,這可以揭示MLLMs的中間CoT推理質量。我們希望MathVerse基準可以提供獨特的見解,以指導未來MLLMs的發展。項目頁面:https://mathverse-cuhk.github.io
English
The remarkable progress of Multi-modal Large Language Models (MLLMs) has
garnered unparalleled attention, due to their superior performance in visual
contexts. However, their capabilities in visual math problem-solving remain
insufficiently evaluated and understood. We investigate current benchmarks to
incorporate excessive visual content within textual questions, which
potentially assist MLLMs in deducing answers without truly interpreting the
input diagrams. To this end, we introduce MathVerse, an all-around visual math
benchmark designed for an equitable and in-depth evaluation of MLLMs. We
meticulously collect 2,612 high-quality, multi-subject math problems with
diagrams from publicly available sources. Each problem is then transformed by
human annotators into six distinct versions, each offering varying degrees of
information content in multi-modality, contributing to 15K test samples in
total. This approach allows MathVerse to comprehensively assess whether and how
much MLLMs can truly understand the visual diagrams for mathematical reasoning.
In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a
fine-grained assessment of the output answers. Rather than naively judging True
or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and
then score each step with detailed error analysis, which can reveal the
intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark
may provide unique insights to guide the future development of MLLMs. Project
page: https://mathverse-cuhk.github.ioSummary
AI-Generated Summary