SpatialWorld: 실제 세계 작업에서 다중 모달 에이전트의 상호작용적 공간 추론 벤치마킹

초록

공간 추론은 멀티모달 대규모 언어 모델(MLLM)이 물리적 세계를 인지하고 작동하기 위한 기초적인 능력이다. 그러나 기존 벤치마크는 주로 수동 평가(예: 정적 VQA)나 시뮬레이터별 파이프라인에 의존하여, 일반적인 상호작용적 공간 이해를 평가하는 데 한계가 있다. 본 논문에서는 복잡한 실제 세계 과제에서 멀티모달 에이전트의 상호작용적 공간 이해를 평가하기 위해 특별히 설계된 통합 벤치마크인 SpatialWorld를 소개한다. SpatialWorld는 시뮬레이터에 구애받지 않는 공유 프로토콜 하에 여덟 가지 이질적인 시뮬레이션 백엔드를 통합하며, 다양한 도메인(예: 가정 내 일상, 여행, 사회적 협력)에 걸쳐 760개의 사람이 주석을 단 과제를 특징으로 한다. 에이전트는 시각 정보만으로 부분 관측이 가능한 환경에서 과제를 해결해야 하며, 능동적으로 자기 중심적 시각 증거를 수집하고, MLLM에 특화된 통합 텍스트 기반 행동 인터페이스를 통해 결정을 표현해야 한다. 신뢰할 수 있는 평가를 위해 각 과제는 사람이 검증한 초기 상태, 참조 궤적, 그리고 종료 상태 검증기를 포함한다. 15개의 첨단 에이전트를 평가한 결과, 강력한 공간 과제 해결은 여전히 어려운 과제임이 드러났다: 가장 강력한 모델인 GPT-5의 평균 과제 성공률(TSR)은 17.4%에 불과했으며, 선도적인 오픈소스 모델인 Qwen-3.5는 14.1%에 도달했다. 추가 분석은 과제 성공과 실행 효율성 간의 명확한 불일치와 함께 상당한 도메인별 성능 차이를 드러낸다. 능동적 탐색과 장기 계획에서의 이러한 병목 현상은 SpatialWorld를 미래 공간 에이전트를 위한 엄격한 테스트베드로 자리매김하게 한다.

English

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.