ChatPaper.aiChatPaper

空間交流:基於語言中介的局部視角空間整合

Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

March 28, 2026
作者: Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, Aishwarya Agrawal
cs.AI

摘要

人類透過交流局部、視角依賴的觀察來建立共享的空間理解。我們探討多模態大型語言模型(MLLMs)是否也能實現相同功能,透過對話協調不同的自我中心視角,形成對共享環境的連貫且全域中心的心理模型。為系統性研究此問題,我們提出COSMIC——一個專注於協作式空間溝通的基準測試框架。在此設定中,兩個靜態MLLM智能體從不同視角觀察3D室內環境,並透過自然語言訊息交換來解決空間查詢。COSMIC包含899個多樣化場景及涵蓋五項任務的1250組問答對。我們發現存在一致的能力層級:MLLMs在跨視角識別共享錨定物體時最可靠,在關係推理方面表現較差,而在建立全局一致性地圖時幾乎完全失敗(即使前沿模型的表現也接近隨機機率)。此外,我們發現思考能力能穩定提升錨定物體定位的準確率,但對於更高層級的空間溝通仍顯不足。為對比模型行為,我們額外收集了250組人與人間的對話資料。人類達成95%的綜合準確率,而表現最佳的Gemini-3-Pro-Thinking模型僅達72%,顯示模型仍有顯著改進空間。進一步分析顯示,人類對話會隨著雙方心理模型趨同而愈發具體,而模型對話則持續探索新可能性而非收斂,這與其建立並維護穩健共享心理模型的能力受限的現象一致。相關程式碼與資料已公開於https://github.com/ankursikarwar/Cosmic。
English
Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic
PDF121April 7, 2026