ChatPaper.aiChatPaper

SpatialWorld:在真實世界任務中對多模態代理的互動式空間推理能力進行基準測試

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

June 8, 2026
作者: Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan, Yinpeng Dong
cs.AI

摘要

空间推理是使多模态大语言模型(MLLMs)能够感知并在物理世界中运作的基础能力。然而,现有基准测试主要依赖被动评估(如静态VQA)或特定模拟器的流程,未能衡量一般的交互式空间理解能力。我们提出了SpatialWorld,这是一个专门用于评估多模态智能体在复杂现实任务中交互式空间理解能力的统一基准。通过在一个共享的、与模拟器无关的协议下集成八个异构模拟后端,SpatialWorld包含了760个跨不同领域(如家庭日常事务、旅行、社交协作)的人工标注任务。智能体必须在仅依赖视觉的部分可观测条件下解决问题,主动收集以自我为中心的视觉证据,并通过一个对MLLMs原生的、基于文本的统一动作接口来表达决策。为确保可靠评估,每个任务包含一个经过人工验证的初始状态、一条参考轨迹和一个终端状态验证器。对15个先进智能体的评估显示,稳健的空间任务求解仍然具有挑战性:最强的模型GPT-5平均任务成功率(TSR)仅为17.4%,而领先的开源模型Qwen-3.5达到14.1%。进一步的分析揭示了任务成功与执行效率之间的明显不匹配,以及显著的领域特定性能差异。这些在主动探索和长周期规划方面的瓶颈,使SpatialWorld成为未来空间智能体的严格测试平台。
English
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.