COOPER:空間智能中協同感知與推理的統一模型
COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
December 4, 2025
作者: Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu
cs.AI
摘要
視覺空間推理對於多模態大型語言模型理解物體屬性和空間關係至關重要,但現有模型仍難以實現三維感知推理。當前方法通常通過兩種孤立途徑進行增強:要麼藉助深度圖和分割圖等輔助模態擴充RGB輸入以強化感知能力,要麼通過空間視覺問答數據集訓練並結合強化學習來提升推理能力。本研究探討統一的多模態大模型能否通過自適應交織推理機制,發展出內在的空間感知增強能力,從而實現更強大的空間智能。我們提出COPER模型,該統一架構利用深度與分割作為輔助模態,通過兩階段訓練獲得輔助模態生成能力及自適應交織推理能力。COOPER在空間推理任務中平均提升6.91%,同時保持通用性能。值得注意的是,僅進行輔助模態生成訓練的變體在距離與尺寸估計任務中仍獲得7.92%的提升,這表明學習生成輔助模態有助於模型內化空間知識並強化空間理解能力。
English
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose COOPER, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average 6.91\% improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a 7.92\% gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.