MuirBench:一個全面的基準測試,用於強健的多圖像理解
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
June 13, 2024
作者: Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen
cs.AI
摘要
我們介紹了 MuirBench,這是一個專注於多模態語言模型 (LLMs) 的強健多圖像理解能力的全面基準測試。MuirBench 包含了12個多樣化的多圖像任務(例如場景理解、排序),涉及10個多圖像關係類別(例如多視圖、時間關係)。MuirBench 包含11,264張圖像和2,600個多選題,是以成對方式創建的,每個標準實例都與一個幾乎沒有語義差異的無法回答的變體配對,以進行可靠的評估。在對20個最近的多模態LLMs進行評估後,我們的結果顯示,即使是表現最佳的模型如GPT-4o和Gemini Pro,在解決MuirBench時也面臨著挑戰,準確率分別為68.0%和49.3%。基於單張圖像訓練的開源多模態LLMs幾乎無法泛化到多圖像問題,準確率低於33.3%。這些結果突顯了MuirBench在鼓勵社區開發能夠超越單張圖像的多模態LLMs方面的重要性,並提出了未來改進的潛在途徑。
English
We introduce MuirBench, a comprehensive benchmark that focuses on robust
multi-image understanding capabilities of multimodal LLMs. MuirBench consists
of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that
involve 10 categories of multi-image relations (e.g., multiview, temporal
relations). Comprising 11,264 images and 2,600 multiple-choice questions,
MuirBench is created in a pairwise manner, where each standard instance is
paired with an unanswerable variant that has minimal semantic differences, in
order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our
results reveal that even the best-performing models like GPT-4o and Gemini Pro
find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy.
Open-source multimodal LLMs trained on single images can hardly generalize to
multi-image questions, hovering below 33.3% in accuracy. These results
highlight the importance of MuirBench in encouraging the community to develop
multimodal LLMs that can look beyond a single image, suggesting potential
pathways for future improvements.Summary
AI-Generated Summary