空間多模態大語言模型（Spatial-MLLM）：提升視覺空間智能中的多模態大語言模型能力

摘要

多模态大型語言模型（MLLMs）的最新進展顯著提升了在二維視覺任務上的表現。然而，增強其空間智能仍是一大挑戰。現有的三維MLLMs通常依賴於額外的三維或二點五維數據來融入空間感知，這限制了它們在僅有二維輸入（如圖像或視頻）場景中的應用。本文提出了一種新穎的框架——空間MLLM，專注於從純二維觀察中進行視覺基礎的空間推理。與傳統依賴於針對語義理解優化的CLIP視覺編碼器的視頻MLLMs不同，我們的關鍵洞察是釋放前饋視覺幾何基礎模型中的強大結構先驗。具體而言，我們設計了一種雙編碼器架構：一個預訓練的二維視覺編碼器用於提取語義特徵，以及一個從視覺幾何模型骨幹初始化的空間編碼器，用於提取三維結構特徵。隨後，一個連接器將這兩種特徵整合為統一的視覺標記，以增強空間理解。此外，我們在推理階段提出了一種空間感知的幀採樣策略，該策略從視頻序列中選取富含空間信息的幀，確保即使在標記長度有限的情況下，模型也能聚焦於對空間推理至關重要的幀。除了架構上的改進，我們還構建了Spatial-MLLM-120k數據集，並通過監督微調和GRPO方法對模型進行訓練。在多個現實世界數據集上的廣泛實驗表明，我們的空間MLLM在廣泛的視覺基礎空間理解與推理任務中達到了最先進的性能。項目頁面：https://diankun-wu.github.io/Spatial-MLLM/。

English

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

空間多模態大語言模型（Spatial-MLLM）：提升視覺空間智能中的多模態大語言模型能力

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

摘要

Support