MMSI-Video-Bench:基于视频的空间智能全维度基准测试框架
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
December 11, 2025
作者: Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, Jiangmiao Pang
cs.AI
摘要
空间理解能力对于多模态大语言模型在物理环境中进化为通用助手至关重要。然而目前仍缺乏全面评估该目标进展的综合基准。本研究推出MMSI-Video-Bench——首个全人工标注的视频空间智能基准,通过感知、规划、预测和跨视频推理的四层框架,基于25个数据集及自采视频的1,278个片段构建了1,106道 grounded 问题。每个题目均由三维视觉专家精心设计并复核,附带解释性依据以确保精准无歧义。借助多样化数据源和全任务覆盖,本基准还支持室内场景感知、机器人操作和实体定位三大领域专项评估。我们对25个开源与商业模型进行测试,发现显著的人机差距:多数模型表现接近随机猜测,最佳推理模型落后人类近60%。研究进一步表明,经过空间微调的模型仍难以有效泛化至本基准。细粒度错误分析揭示了几何推理、运动定位、长时程预测和跨视频关联的系统性缺陷。典型帧采样策略在推理密集型任务中迁移效果不佳,三维空间线索与思维链提示均未带来显著提升。本基准有望为推进视频空间智能研究建立坚实的测试平台。
English
Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.