ChatPaper.aiChatPaper

多空間MLLM:基於多模態大語言模型的多幀空間理解

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

May 22, 2025
作者: Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang
cs.AI

摘要

多模態大型語言模型(MLLMs)在視覺任務上取得了快速進展,但其空間理解能力仍局限於單一圖像,這使得它們在需要多幀推理的機器人技術及其他現實世界應用中表現欠佳。本文提出了一種框架,通過整合深度感知、視覺對應和動態感知,賦予MLLMs強大的多幀空間理解能力。我們方法的核心是MultiSPA數據集,這是一個新穎的大規模數據集,涵蓋了超過2700萬個樣本,跨越多樣的三維和四維場景。與MultiSPA一同,我們引入了一個全面的基準測試,該測試在統一指標下檢驗了廣泛的空間任務。我們最終的模型,Multi-SpatialMLLM,在基線和專有系統上取得了顯著的提升,展示了可擴展、可泛化的多幀推理能力。我們進一步觀察到了多任務的益處以及在挑戰性場景中湧現能力的早期跡象,並展示了我們的模型如何作為機器人技術的多幀獎勵註釋器。
English
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

Summary

AI-Generated Summary

PDF42May 23, 2025