ESI-Bench: 지각-행동 루프를 닫는 체화된 공간 지능을 향하여

초록

공간 지능은 지각-행동 루프(perception-action loop)를 통해 전개된다: 에이전트는 관찰을 획득하기 위해 행동하며, 관찰이 행동의 함수로 어떻게 변화하는지 추론한다. 에이전트는 보이는 것을 수동적으로 처리하는 대신 보이지 않는 것, 즉 수동적 감각만으로는 해결할 수 없는 가려진 구조, 역학, 포함 관계, 기능성을 능동적으로 발견한다. 우리는 관찰자를 행위자로 재정의함으로써 오라클 관찰(oracle observations)을 가정한 기존의 공간 지능 공식을 넘어선다. 우리는 Spelke의 핵심 지식 시스템에 기반을 둔 OmniGibson 위에 구축된 10개 작업 범주와 29개 하위 범주를 포괄하는 체화된 공간 지능을 위한 포괄적 벤치마크인 ESI-BENCH를 소개한다. 에이전트는 어떤 능력(지각, 이동, 조작)을 배치할지, 그리고 작업 관련 증거를 능동적으로 축적하기 위해 이들을 어떻게 순차적으로 구성할지 결정해야 한다. 우리는 최신 MLLM에 대한 광범위한 실험을 수행했으며, 능동적 탐색이 수동적 대응 방식보다 훨씬 뛰어난 성능을 보임을 발견했다. 에이전트는 명시적 지시 없이도 자발적으로 새로운 공간 전략을 발견하는 반면, 무작위 다중 시점(random multi-view)은 훨씬 더 많은 이미지를 소비함에도 불구하고 신호보다는 잡음을 추가하는 경우가 많았다. 대부분의 실패는 취약한 지각이 아닌 행동 맹목(action blindness)에서 비롯된다: 잘못된 행동 선택이 좋지 않은 관찰로 이어지고, 이는 연쇄적 오류를 유발한다. 명시적 3D 접지는 깊이 민감 작업에서 추론을 안정화하지만, 불완전한 3D 표현은 공간 관계를 왜곡하여 2D 기준선보다 더 해로운 것으로 증명되었다. 인간 연구는 추가로, 인간이 반증적 시점을 찾고 모순 아래에서 신념을 수정하는 것과 달리, 모델은 증거의 질과 관계없이 높은 신뢰도로 조기에 확정하며, 이는 더 나은 지각이나 더 많은 체화된 상호작용만으로는 해소할 수 없는 메타인지적 격차(metacognitive gap)를 드러냄을 보여준다.

English

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.