MuirBench:用于稳健多图像理解的全面基准。
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
June 13, 2024
作者: Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen
cs.AI
摘要
我们介绍了MuirBench,这是一个专注于多模态LLM的强大多图像理解能力的综合基准。MuirBench包括12个不同的多图像任务(例如场景理解、排序),涉及10个多图像关系类别(例如多视角、时间关系)。MuirBench由11,264张图像和2,600个多项选择题组成,是以成对方式创建的,每个标准实例都与一个无法回答的变体配对,二者之间具有最小的语义差异,以便进行可靠评估。在对20个最近的多模态LLM进行评估后,我们的结果显示,即使是表现最好的模型如GPT-4o和Gemini Pro也发现解决MuirBench具有挑战性,准确率分别为68.0%和49.3%。基于单图像训练的开源多模态LLM几乎无法泛化到多图像问题,准确率低于33.3%。这些结果突显了MuirBench的重要性,鼓励社区开发能够超越单个图像的多模态LLM,为未来改进提出潜在途径。
English
We introduce MuirBench, a comprehensive benchmark that focuses on robust
multi-image understanding capabilities of multimodal LLMs. MuirBench consists
of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that
involve 10 categories of multi-image relations (e.g., multiview, temporal
relations). Comprising 11,264 images and 2,600 multiple-choice questions,
MuirBench is created in a pairwise manner, where each standard instance is
paired with an unanswerable variant that has minimal semantic differences, in
order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our
results reveal that even the best-performing models like GPT-4o and Gemini Pro
find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy.
Open-source multimodal LLMs trained on single images can hardly generalize to
multi-image questions, hovering below 33.3% in accuracy. These results
highlight the importance of MuirBench in encouraging the community to develop
multimodal LLMs that can look beyond a single image, suggesting potential
pathways for future improvements.Summary
AI-Generated Summary