SpatialWorld:在真实世界任务中评估多模态智能体的交互式空间推理能力
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
June 8, 2026
作者: Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan, Yinpeng Dong
cs.AI
摘要
空间推理是多模态大语言模型(MLLMs)感知并操作物理世界的基础能力。然而,现有基准主要依赖被动评估(如静态视觉问答)或特定模拟器的流程,未能全面评估通用交互式空间理解能力。我们提出了SpatialWorld——一个专为评估多模态智能体在复杂真实世界任务中交互式空间理解能力而设计的统一基准。该基准在共享的、与模拟器无关的协议下集成了八个异构模拟后端,包含760个经人工标注的任务,涵盖家庭日常、旅行、社交协作等多个领域。智能体必须在仅依赖视觉的部分可观测条件下解决问题,主动收集第一人称视角的视觉证据,并通过统一且原生适配多模态大语言模型的基于文本的动作接口表达决策。为确保评估可靠性,每个任务均包含经人工验证的初始状态、参考轨迹以及终止状态验证器。对15个先进智能体的评估表明,稳健的空间任务求解仍具挑战:最强模型GPT-5的平均任务成功率(TSR)仅为17.4%,领先的开源模型Qwen-3.5达到14.1%。进一步分析揭示了任务成功与执行效率之间的显著不匹配,以及领域间性能的大幅差异。这些在主动探索与长程规划方面的瓶颈,使SpatialWorld成为未来空间智能体的严谨测试平台。
English
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.