MMEvalPro:面向可信和高效评估的多模态基准校准
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
June 29, 2024
作者: Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang
cs.AI
摘要
大型多模态模型(LMMs)展现出令人印象深刻的跨模态理解和推理能力,通常通过包含图像、问题和多个选项的多项选择题(MCQs)进行评估。然而,许多用于此类评估的基准存在系统性偏差。值得注意的是,没有任何视觉感知能力的大型语言模型(LLMs)也能取得非平凡的表现,从而削弱了这些评估的可信度。为了解决这个问题,同时保持MCQ评估的效率,我们提出了MMEvalPro,这是一个旨在避免第一类错误的基准,通过三部曲评估流程和更严格的度量标准。对于现有基准中的每个原始问题,人类标注者通过精细的注释过程,通过创建一个感知问题和一个知识锚问题来扩充它。MMEvalPro包括2,138个问题三元组,总共6,414个不同问题。其中三分之二的问题由人类专家手动标记,其余的来自现有基准(MMMU、ScienceQA和MathVista)。与现有基准相比,我们对最新的LLMs和LMMs进行的实验表明,MMEvalPro更具挑战性(最佳LMM的表现落后于人类表现31.73%,而之前基准的平均差距为8.03%),更值得信赖(最佳LLM落后于最佳LMM 23.09%,而之前基准的差距仅为14.64%)。我们的深入分析解释了表现差距的原因,并证明了评估的可信度,突显了其对推动未来研究具有重要潜力。
English
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding
and reasoning abilities, often assessed through multiple-choice questions
(MCQs) that include an image, a question, and several options. However, many
benchmarks used for such evaluations suffer from systematic biases. Remarkably,
Large Language Models (LLMs) without any visual perception capabilities achieve
non-trivial performance, undermining the credibility of these evaluations. To
address this issue while maintaining the efficiency of MCQ evaluations, we
propose MMEvalPro, a benchmark designed to avoid Type-I errors through a
trilogy evaluation pipeline and more rigorous metrics. For each original
question from existing benchmarks, human annotators augment it by creating one
perception question and one knowledge anchor question through a meticulous
annotation process. MMEvalPro comprises 2,138 question triplets, totaling
6,414 distinct questions. Two-thirds of these questions are manually labeled
by human experts, while the rest are sourced from existing benchmarks (MMMU,
ScienceQA, and MathVista). Compared with the existing benchmarks, our
experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more
challenging (the best LMM lags behind human performance by 31.73%, compared
to an average gap of 8.03% in previous benchmarks) and more trustworthy (the
best LLM trails the best LMM by 23.09%, whereas the gap for previous
benchmarks is just 14.64%). Our in-depth analysis explains the reason for
the large performance gap and justifies the trustworthiness of evaluation,
underscoring its significant potential for advancing future research.Summary
AI-Generated Summary