Spatial-MLLM：視覚ベースの空間知能におけるMLLM能力の向上

要旨

近年のマルチモーダル大規模言語モデル（MLLM）の進展により、2次元視覚タスクにおける性能が大幅に向上しています。しかし、空間知能の向上は依然として課題です。既存の3D MLLMは、空間認識を組み込むために追加の3Dまたは2.5Dデータに依存しており、画像や動画など2D入力のみのシナリオでの有用性が制限されています。本論文では、純粋に2D観測から視覚ベースの空間推論を行うための新しいフレームワークであるSpatial-MLLMを提案します。従来の動画MLLMが意味理解に最適化されたCLIPベースの視覚エンコーダに依存しているのに対し、我々の鍵となる洞察は、フィードフォワード視覚幾何学基盤モデルから強力な構造事前情報を引き出すことです。具体的には、意味的特徴を抽出するための事前学習済み2D視覚エンコーダと、視覚幾何学モデルのバックボーンから初期化された空間エンコーダを用いて3D構造特徴を抽出する、デュアルエンコーダアーキテクチャを提案します。コネクタは両方の特徴を統合し、空間理解を強化するための統一された視覚トークンを生成します。さらに、推論時に空間的に情報量の多い動画フレームを選択する空間認識フレームサンプリング戦略を提案し、トークン長が限られている場合でも、モデルが空間推論に重要なフレームに焦点を当てることを保証します。アーキテクチャの改善に加えて、Spatial-MLLM-120kデータセットを構築し、教師ありファインチューニングとGRPOを用いてモデルを学習させます。様々な実世界データセットでの広範な実験により、我々のSpatial-MLLMが視覚ベースの空間理解および推論タスクにおいて最先端の性能を達成することが示されています。プロジェクトページ: https://diankun-wu.github.io/Spatial-MLLM/。

English

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

Spatial-MLLM：視覚ベースの空間知能におけるMLLM能力の向上

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

要旨

Support