MMSI-Bench：多图像空间智能基准测试平台

摘要

空间智能对于在复杂物理世界中运行的多模态大语言模型（MLLMs）至关重要。然而，现有基准仅探测单张图像关系，因而无法评估现实世界部署所需的多图像空间推理能力。我们推出了MMSI-Bench，一个专注于多图像空间智能的视觉问答（VQA）基准。六位3D视觉研究人员耗时超过300小时，从超过12万张图像中精心制作了1000道具有挑战性且无歧义的多选题，每道题均配有精心设计的干扰项及逐步推理过程。我们进行了广泛的实验，全面评估了34个开源及专有的MLLMs，发现存在显著差距：最强的开源模型准确率约为30%，OpenAI的o3推理模型达到40%，而人类得分高达97%。这些结果凸显了MMSI-Bench的挑战性及未来研究的巨大提升空间。利用标注的推理过程，我们还提供了一个自动化错误分析管道，诊断出四大主要失败模式，包括（1）基础错误，（2）重叠匹配与场景重建错误，（3）情境转换推理错误，以及（4）空间逻辑错误，为推进多图像空间智能研究提供了宝贵洞见。项目页面：https://runsenxu.com/projects/MMSI_Bench。

English

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

MMSI-Bench：多图像空间智能基准测试平台

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

摘要

Support