Multi-SpatialMLLM: マルチモーダル大規模言語モデルを用いたマルチフレーム空間理解

要旨

マルチモーダル大規模言語モデル（MLLMs）は視覚タスクにおいて急速に進化を遂げているものの、その空間理解能力は単一画像に限定されており、ロボティクスやその他の現実世界のアプリケーションで必要とされるマルチフレーム推論には不向きである。本論文では、深度知覚、視覚的対応、動的知覚を統合することで、MLLMsに堅牢なマルチフレーム空間理解能力を備えさせるフレームワークを提案する。我々のアプローチの中核となるのは、多様な3Dおよび4Dシーンにわたる2,700万以上のサンプルを収録した新規の大規模データセット、MultiSPAである。MultiSPAとともに、統一された指標の下で幅広い空間タスクをテストする包括的なベンチマークを導入する。結果として得られたモデル、Multi-SpatialMLLMは、ベースラインやプロプライエタリシステムを大幅に上回る性能を示し、スケーラブルで汎用的なマルチフレーム推論を実証する。さらに、マルチタスクの利点や、困難なシナリオにおける新たな能力の萌芽を観察し、我々のモデルがロボティクスのためのマルチフレーム報酬アノテーターとして機能する方法を示す。

English

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

Multi-SpatialMLLM: マルチモーダル大規模言語モデルを用いたマルチフレーム空間理解

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

要旨

Support