ChatPaper.aiChatPaper

多空间MLLM:基于多模态大语言模型的多帧空间理解

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

May 22, 2025
作者: Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang
cs.AI

摘要

多模态大语言模型(MLLMs)在视觉任务上取得了快速进展,但其空间理解能力仍局限于单幅图像,这使得它们难以适应需要多帧推理的机器人技术及其他现实世界应用。本文提出了一种框架,通过整合深度感知、视觉对应和动态感知,赋予MLLMs强大的多帧空间理解能力。我们方法的核心是MultiSPA数据集,这是一个新颖的大规模数据集,包含超过2700万个样本,涵盖了多样化的3D和4D场景。与MultiSPA一同,我们引入了一个全面的基准测试,该测试在统一指标下检验了广泛的空间任务。我们的最终模型Multi-SpatialMLLM在基线模型和专有系统上取得了显著提升,展示了可扩展、可泛化的多帧推理能力。我们进一步观察到了多任务优势以及在挑战性场景中初现的涌现能力,并展示了我们的模型如何作为机器人技术的多帧奖励标注器发挥作用。
English
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

Summary

AI-Generated Summary

PDF32May 23, 2025