ESI-Bench：邁向閉合感知-行動循環的具身空間智能

摘要

空間智能透過感知-行為迴路展現：智能體透過行動獲取觀測，並推論觀測如何隨行動而變化。它們並非被動處理所見資訊，而是主動探索未見之物——包含遮蔽結構、動態變化、空間包含性及功能性，這些單靠被動感知無法解析。我們突破以往將觀測視為完美資訊的空間智能框架，重新將觀測者定位為行動者。本研究提出 ESI-BENCH，一個基於 OmniGibson、植根於史貝克核心知識系統的全面性體現式空間智能評測基準，涵蓋 10 個任務類別與 29 個子類別。智能體必須決定該調用哪些能力——感知、移動與操作——以及如何依序執行，以主動累積任務相關證據。我們對當前最先進的多模態大語言模型進行廣泛實驗，發現主動探索顯著優於被動對應版本：智能體在無明確指令下自發湧現新興空間策略，而隨機多視角取樣非但未提升訊號，反而在消耗更多影像的同時引入雜訊。大多數失敗並非源於感知薄弱，而是行動盲目：錯誤的行動選擇導致不良觀測，進而引發連鎖錯誤。儘管明確的三維空間表徵能穩定深度敏感任務的推理，但非完美三維表徵因扭曲空間關係，反而比二維基線造成更大傷害。人類研究進一步揭示：與人類尋求反證視角並在矛盾下修正信念不同，模型不問證據品質便過早做出高信心決策，暴露出後設認知鴻溝——這項缺陷無法單靠更佳感知或更多體現互動來彌合。

English

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.