空間的コミュニケーション：部分視点を超えた言語媒介的空間統合

要旨

人間は、部分的で視点に依存した観察結果を伝達することで、共通の空間理解を構築します。我々は、マルチモーダル大規模言語モデル（MLLM）が同様のことができるか、すなわち、異なるエゴセントリックな視点を対話を通じて調整し、共有環境の一貫したアロセントリックなメンタルモデルを形成できるかを問います。これを体系的に研究するため、我々は協調的空間コミュニケーションのベンチマークであるCOSMICを導入します。この設定では、2つの静止したMLLMエージェントが異なる視点から3D室内環境を観察し、自然言語メッセージを交換して空間クエリを解決します。COSMICは、899の多様なシーンと、5つのタスクにわたる1250の質問-回答ペアを含んでいます。一貫した能力階層が見られ、MLLMは視点間で共有アンカーオブジェクトを識別するのが最も信頼性が高く、関係推論では性能が低下し、グローバルに一貫した地図を構築することはほぼ不可能で、最先端モデルであっても偶然レベルの性能に留まります。さらに、思考能力はアンカー接地で一貫した向上をもたらすが、高次元の空間コミュニケーションには不十分であることが分かりました。モデルの行動を文脈化するため、我々は追加で250の人間同士の対話を収集しました。人間は95%の総合精度を達成し、最高性能モデルであるGemini-3-Pro-Thinkingでさえ72%の精度に留まり、大幅な改善の余地が残されています。さらに、人間の会話はパートナーが共通のメンタルモデルに収束するにつれて次第に具体的になりますが、モデルの対話は収束せずに新たな可能性を探り続け、堅牢な共有メンタルモデルを構築・維持する能力が限られていることと一致します。コードとデータはhttps://github.com/ankursikarwar/Cosmicで公開しています。

English

Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic

空間的コミュニケーション：部分視点を超えた言語媒介的空間統合

Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

要旨

Support