空间理论:基础模型能否通过主动探索构建空间信念?
Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
February 4, 2026
作者: Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li
cs.AI
摘要
空间具身智能要求智能体在部分可观测环境下通过主动行为获取信息。尽管多模态基础模型在被动感知方面表现卓越,但其主动探索能力仍待深入研究。我们提出"空间理论",定义为智能体通过自主主动探索获取信息,并从序列化局部观测中构建、修正和利用空间认知的能力。我们通过构建认知地图的好奇心驱动探索基准进行评估,其核心创新在于空间信念探测技术——在每一步骤提示模型揭示其内部空间表征。对前沿模型的评估揭示了若干关键瓶颈:首先,我们发现存在"主动-被动差距",当智能体需自主收集信息时性能显著下降;其次,模型探索效率低下,与基于程序的代理相比缺乏系统性。信念探测分析表明,感知虽是初始瓶颈,但全局信念存在不稳定性,导致空间认知随时间退化。通过错误信念实验,我们揭示了"信念惯性"现象——智能体难以用新证据更新过时先验,该问题在文本智能体中存在,但在视觉模型中尤为严重。研究表明,现有基础模型在主动探索过程中难以维持连贯可修正的空间信念。
English
Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.