ESI-Bench: 知覚・行動ループを閉じる身体化空間知能に向けて

要旨

空間知能は知覚−行動のループを通じて展開する。すなわち、エージェントは観測を得るために行動し、行動の関数として観測がどのように変化するかを推論する。見えているものを受動的に処理するのではなく、遮蔽された構造、動態、内包性、機能性といった、受動的な知覚だけでは解像できないもの——を能動的に明らかにする。本研究では、観測者が行為者として位置づけられ、神託的な観測を仮定する従来の空間知能の定式化を超える。我々は、OmniGibson上に構築され、Spelkeの核となる知識体系に基づく、10タスクカテゴリ・29サブカテゴリにわたる包括的な身体化空間知能ベンチマーク、ESI-BENCHを導入する。エージェントは、どの能力（知覚、移動、操作）を展開するか、そしてそれらをどのように順序づけてタスクに関連する証拠を能動的に蓄積するかを決定しなければならない。最先端のマルチモーダル大規模言語モデル（MLLM）を用いた広範な実験の結果、能動的探索は受動的なものより大幅に優れており、明示的な指示なしにエージェントが創発的な空間戦略を自発的に発見する一方、ランダムな多視点は、はるかに多くの画像を消費するにもかかわらず信号ではなくノイズを加えることが多いことが判明した。失敗の大半は、知覚の弱さではなく、行動盲に起因する。すなわち、誤った行動選択が貧弱な観測を生み、それが連鎖的な誤りを引き起こす。明示的な3D接地は奥行きに敏感なタスクにおける推論を安定させるが、不完全な3D表現は空間関係を歪めることで2Dベースラインよりも有害であることが示される。さらに、人間を対象とした研究により、人間が反証となる視点を求め、矛盾に直面して信念を修正するのとは異なり、モデルは証拠の質にかかわらず高い確信度で早々にコミットし、より良い知覚やより多くの身体化されたインタラクションだけでは埋められないメタ認知のギャップを露呈することが明らかになった。

English

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.