MMSI-Bench:多圖像空間智能基準測試平台
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
May 29, 2025
作者: Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang
cs.AI
摘要
空間智能對於在複雜物理世界中運作的多模態大型語言模型(MLLMs)至關重要。然而,現有的基準測試僅探討單一圖像的關係,因此無法評估現實世界部署所需的多圖像空間推理能力。我們引入了MMSI-Bench,這是一個專注於多圖像空間智能的視覺問答(VQA)基準測試。六位3D視覺研究人員花費超過300小時,從超過120,000張圖像中精心設計了1,000道具有挑戰性且無歧義的多選題,每道題目均配備了精心設計的干擾項和逐步推理過程。我們進行了廣泛的實驗,並全面評估了34個開源和專有的MLLMs,觀察到一個顯著的差距:最強的開源模型僅達到約30%的準確率,而OpenAI的o3推理模型達到40%,而人類的得分則高達97%。這些結果凸顯了MMSI-Bench的挑戰性以及未來研究的巨大潛力。利用註釋的推理過程,我們還提供了一個自動化的錯誤分析管道,診斷出四種主要的失敗模式,包括(1)基礎錯誤,(2)重疊匹配和場景重建錯誤,(3)情境轉換推理錯誤,以及(4)空間邏輯錯誤,為推進多圖像空間智能提供了寶貴的見解。項目頁面:https://runsenxu.com/projects/MMSI_Bench。
English
Spatial intelligence is essential for multimodal large language models
(MLLMs) operating in the complex physical world. Existing benchmarks, however,
probe only single-image relations and thus fail to assess the multi-image
spatial reasoning that real-world deployments demand. We introduce MMSI-Bench,
a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision
researchers spent more than 300 hours meticulously crafting 1,000 challenging,
unambiguous multiple-choice questions from over 120,000 images, each paired
with carefully designed distractors and a step-by-step reasoning process. We
conduct extensive experiments and thoroughly evaluate 34 open-source and
proprietary MLLMs, observing a wide gap: the strongest open-source model
attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while
humans score 97%. These results underscore the challenging nature of MMSI-Bench
and the substantial headroom for future research. Leveraging the annotated
reasoning processes, we also provide an automated error analysis pipeline that
diagnoses four dominant failure modes, including (1) grounding errors, (2)
overlap-matching and scene-reconstruction errors, (3) situation-transformation
reasoning errors, and (4) spatial-logic errors, offering valuable insights for
advancing multi-image spatial intelligence. Project page:
https://runsenxu.com/projects/MMSI_Bench .