공간에 관한 소통: 부분적 관점을 넘어선 언어 매개 공간 통합

초록

인간은 부분적이고 시점에 종속적인 관찰을 통해 소통하며 공유된 공간 이해를 구축합니다. 우리는 다중모드 대형 언어 모델(MLLM)이 동일한 작업을 수행할 수 있는지, 즉 상이한 자기 중심적 시점을 대화를 통해 정렬하여 공유 환경에 대한 일관된 전방위적 심적 모델을 형성할 수 있는지 질문합니다. 이를 체계적으로 연구하기 위해 우리는 협력적 공간 소통을 위한 벤치마크인 COSMIC을 소개합니다. 이 설정에서 두 개의 정적 MLLM 에이전트는 서로 다른 시점에서 3D 실내 환경을 관찰하고 자연어 메시지를 교환하여 공간 질의를 해결합니다. COSMIC은 899개의 다양한 장면과 5가지 작업에 걸친 1250개의 질문-답변 쌍을 포함합니다. 우리는 일관된 능력 계층 구조를 발견했습니다. MLLM은 다양한 시점에서 공유 앵커 객체를 식별하는 데 가장 안정적이며, 관계적 추론에서는 성능이 낮아지고, 전역적으로 일관된 지도를 구축하는 데에는 크게 실패하여 최첨단 모델의 경우에도 거의 무작위 수준의 성능을 보입니다. 더 나아가, 사고 능력은 앵커 기반 확립에서 일관된 성과 향상을 가져오지만, 더 높은 수준의 공간 소통을 위해서는 불충분함을 발견했습니다. 모델 행동을 맥락화하기 위해, 우리는 추가적으로 250개의 인간-인간 대화를 수집했습니다. 인간은 95%의 종합 정확도를 달성하여, 최고 성능 모델인 Gemini-3-Pro-Thinking(72% 종합 정확도)에게도 상당한 개선의 여지를 남겼습니다. 또한, 인간의 대화는 상대방이 공유 심적 모델에 수렴함에 따라 점점 더 구체적으로 변하는 반면, 모델 대화는 수렴하기보다는 새로운 가능성을 계속 탐색하는데, 이는 강력한 공유 심적 모델을 구축하고 유지하는 능력이 제한적임을 시사합니다. 우리의 코드와 데이터는 https://github.com/ankursikarwar/Cosmic에서 이용 가능합니다.

English

Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic

공간에 관한 소통: 부분적 관점을 넘어선 언어 매개 공간 통합

Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

초록

Support