ChatPaper.aiChatPaper

空间交流:基于语言的部分视角空间整合研究

Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

March 28, 2026
作者: Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, Aishwarya Agrawal
cs.AI

摘要

人类通过交流局部的、依赖视角的观察来建立共享的空间认知。我们探究多模态大语言模型(MLLMs)是否具备同等能力——能否通过对齐不同自我中心视角的对话,构建出关于共享环境的连贯异中心心理模型。为系统研究此问题,我们推出了COSMIC基准测试,旨在评估协作式空间通信能力。该设定中,两个静态MLLM智能体从不同视角观察同一3D室内环境,通过自然语言消息交互以解决空间查询。COSMIC包含899个多样化场景和1250组问答对,涵盖五项任务。我们发现存在稳定的能力层级:MLLMs在跨视角识别共享锚点物体时最为可靠,关系推理表现较差,而在构建全局一致性地图时几乎完全失效(即使前沿模型的正确率也接近随机猜测)。此外,思维链能力能稳定提升锚点定位性能,但不足以支撑更高层级的空间通信。为量化模型行为,我们还收集了250组人类对话数据。人类对话者总体准确率达95%,而表现最佳的Gemini-3-Pro-Thinking模型仅达72%,存在显著差距。进一步分析发现,随着对话双方心理模型趋同,人类对话会愈发具体;而模型对话则持续探索新可能性而非收敛,这与其构建和维护稳健共享心理模型的能力局限相符。代码与数据已开源:https://github.com/ankursikarwar/Cosmic
English
Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic
PDF121April 7, 2026