SpatialWorld: 実世界タスクにおけるマルチモーダルエージェントの対話的空間推論のベンチマーク

要旨

空間推論は、マルチモーダル大規模言語モデル（MLLM）が物理世界を知覚し、その中で動作するための基盤的な能力である。しかし、既存のベンチマークは主に受動的評価（静的VQAなど）やシミュレータ固有のパイプラインに依存しており、一般的な対話型空間理解を評価できていない。本稿では、複雑な実世界タスクにおけるマルチモーダルエージェントの対話型空間理解を評価するために特化設計された統一ベンチマークSpatialWorldを紹介する。SpatialWorldは、シミュレータに依存しない共有プロトコルの下で8つの異種シミュレーションバックエンドを統合し、家事ルーチン、旅行、社会的協力など多様なドメインにわたる760件の人手注釈付きタスクを備える。エージェントは視覚のみの部分観測下でタスクを解決しなければならず、能動的に一人称視点の視覚的証拠を収集し、MLLMにネイティブな統一テキストベースのアクションインタフェースを通じて決定を表現する。信頼性の高い評価のために、各タスクには人手検証済みの初期状態、参照軌跡、および終端状態検証器が含まれている。15の先進的エージェントを評価した結果、ロバストな空間タスク解決は依然として困難であることが明らかになった。最強モデルであるGPT-5の平均タスク成功率（TSR）はわずか17.4％であり、主要なオープンソースモデルであるQwen-3.5は14.1％に達した。さらなる分析により、タスク成功と実行効率の間には明らかな不一致があり、ドメイン固有の性能変動も顕著であることが判明した。これらの能動的探索と長期計画におけるボトルネックにより、SpatialWorldは将来の空間エージェントのための厳格なテストベッドとして位置づけられる。

English

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.