MMSI-Bench：多圖像空間智能基準測試平台

摘要

空間智能對於在複雜物理世界中運作的多模態大型語言模型（MLLMs）至關重要。然而，現有的基準測試僅探討單一圖像的關係，因此無法評估現實世界部署所需的多圖像空間推理能力。我們引入了MMSI-Bench，這是一個專注於多圖像空間智能的視覺問答（VQA）基準測試。六位3D視覺研究人員花費超過300小時，從超過120,000張圖像中精心設計了1,000道具有挑戰性且無歧義的多選題，每道題目均配備了精心設計的干擾項和逐步推理過程。我們進行了廣泛的實驗，並全面評估了34個開源和專有的MLLMs，觀察到一個顯著的差距：最強的開源模型僅達到約30%的準確率，而OpenAI的o3推理模型達到40%，而人類的得分則高達97%。這些結果凸顯了MMSI-Bench的挑戰性以及未來研究的巨大潛力。利用註釋的推理過程，我們還提供了一個自動化的錯誤分析管道，診斷出四種主要的失敗模式，包括（1）基礎錯誤，（2）重疊匹配和場景重建錯誤，（3）情境轉換推理錯誤，以及（4）空間邏輯錯誤，為推進多圖像空間智能提供了寶貴的見解。項目頁面：https://runsenxu.com/projects/MMSI_Bench。

English

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

MMSI-Bench：多圖像空間智能基準測試平台

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

摘要

Support