空间多模态大模型(Spatial-MLLM):提升视觉空间智能中的MLLM能力
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
May 29, 2025
作者: Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan
cs.AI
摘要
近期,多模态大语言模型(MLLMs)的进展显著提升了其在二维视觉任务上的表现。然而,提升其空间智能仍面临挑战。现有的三维MLLMs通常依赖额外的三维或2.5维数据来融入空间感知,这限制了它们在仅具备二维输入(如图像或视频)场景下的应用。本文提出了一种新颖的框架——Spatial-MLLM,它能够基于纯二维观测进行视觉驱动的空间推理。与依赖CLIP视觉编码器(专为语义理解优化)的传统视频MLLMs不同,我们的核心洞见在于释放前馈视觉几何基础模型中的强大结构先验。具体而言,我们设计了一种双编码器架构:一个预训练的二维视觉编码器用于提取语义特征,以及一个从视觉几何模型主干初始化的空间编码器,用于提取三维结构特征。随后,一个连接器将这两种特征整合为统一的视觉标记,以增强空间理解。此外,我们在推理阶段提出了一种空间感知的帧采样策略,该策略从视频序列中筛选出富含空间信息的帧,确保即使在标记长度有限的情况下,模型也能聚焦于对空间推理至关重要的帧。除了架构上的改进,我们还构建了Spatial-MLLM-120k数据集,并通过监督微调和GRPO方法对模型进行训练。在多个真实世界数据集上的广泛实验表明,我们的Spatial-MLLM在多种视觉驱动的空间理解与推理任务中均达到了最先进的性能。项目页面:https://diankun-wu.github.io/Spatial-MLLM/。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have
significantly enhanced performance on 2D visual tasks. However, improving their
spatial intelligence remains a challenge. Existing 3D MLLMs always rely on
additional 3D or 2.5D data to incorporate spatial awareness, restricting their
utility in scenarios with only 2D inputs, such as images or videos. In this
paper, we present Spatial-MLLM, a novel framework for visual-based spatial
reasoning from purely 2D observations. Unlike conventional video MLLMs which
rely on CLIP-based visual encoders optimized for semantic understanding, our
key insight is to unleash the strong structure prior from the feed-forward
visual geometry foundation model. Specifically, we propose a dual-encoder
architecture: a pretrained 2D visual encoder to extract semantic features, and
a spatial encoder-initialized from the backbone of the visual geometry model-to
extract 3D structure features. A connector then integrates both features into
unified visual tokens for enhanced spatial understanding. Furthermore, we
propose a space-aware frame sampling strategy at inference time, which selects
the spatially informative frames of a video sequence, ensuring that even under
limited token length, the model focuses on frames critical for spatial
reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k
dataset and train the model on it using supervised fine-tuning and GRPO.
Extensive experiments on various real-world datasets demonstrate that our
spatial-MLLM achieves state-of-the-art performance in a wide range of
visual-based spatial understanding and reasoning tasks. Project page:
https://diankun-wu.github.io/Spatial-MLLM/.Summary
AI-Generated Summary