MMEvalPro:校準多模態基準以實現可信且高效的評估
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
June 29, 2024
作者: Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang
cs.AI
摘要
大型多模型(LMMs)展現了令人印象深刻的跨模懂與推理能力,通常透過包含圖像、問題和多個選項的多重選擇題(MCQs)進行評估。然而,許多用於此類評估的基準存在系統性偏見。值得注意的是,沒有任何視覺感知能力的大型語言模型(LLMs)也能取得非微不足道的表現,削弱了這些評估的可信度。為了解決這個問題,同時保持MCQ評估的效率,我們提出了MMEvalPro,這是一個旨在避免第一類錯誤的基準,通過三部曲評估流程和更嚴格的指標設計。對於現有基準中的每個原始問題,人類標註者通過細緻的標註過程,擴充它們,創建一個感知問題和一個知識錨問題。MMEvalPro 包含 2,138 個問題三元組,總計 6,414 個不同問題。其中三分之二的問題由人類專家手動標註,其餘來自現有基準(MMMU、ScienceQA 和 MathVista)。與現有基準相比,我們對最新的LLMs和LMMs進行的實驗表明,MMEvalPro 更具挑戰性(最佳LMM 與人類表現之間的差距為 31.73%,而先前基準的平均差距為 8.03%),並且更值得信賴(最佳LLM 落後最佳LMM 23.09%,而先前基準的差距僅為 14.64%)。我們的深入分析解釋了表現差距的原因,並證明了評估的可信度,突顯了它在推進未來研究方面的重要潛力。
English
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding
and reasoning abilities, often assessed through multiple-choice questions
(MCQs) that include an image, a question, and several options. However, many
benchmarks used for such evaluations suffer from systematic biases. Remarkably,
Large Language Models (LLMs) without any visual perception capabilities achieve
non-trivial performance, undermining the credibility of these evaluations. To
address this issue while maintaining the efficiency of MCQ evaluations, we
propose MMEvalPro, a benchmark designed to avoid Type-I errors through a
trilogy evaluation pipeline and more rigorous metrics. For each original
question from existing benchmarks, human annotators augment it by creating one
perception question and one knowledge anchor question through a meticulous
annotation process. MMEvalPro comprises 2,138 question triplets, totaling
6,414 distinct questions. Two-thirds of these questions are manually labeled
by human experts, while the rest are sourced from existing benchmarks (MMMU,
ScienceQA, and MathVista). Compared with the existing benchmarks, our
experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more
challenging (the best LMM lags behind human performance by 31.73%, compared
to an average gap of 8.03% in previous benchmarks) and more trustworthy (the
best LLM trails the best LMM by 23.09%, whereas the gap for previous
benchmarks is just 14.64%). Our in-depth analysis explains the reason for
the large performance gap and justifies the trustworthiness of evaluation,
underscoring its significant potential for advancing future research.Summary
AI-Generated Summary