ChatPaper.aiChatPaper

M3:三维空间多模态记忆

M3: 3D-Spatial MultiModal Memory

March 20, 2025
作者: Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang
cs.AI

摘要

我们提出了3D空间多模态记忆系统(M3),这是一个旨在通过视频源保留中等规模静态场景信息以支持视觉感知的多模态记忆系统。通过将3D高斯泼溅技术与基础模型相结合,M3构建了一个能够跨粒度渲染特征表示的多模态记忆,涵盖了广泛的知识领域。在探索过程中,我们识别出先前特征泼溅研究中的两个关键挑战:(1) 存储每个高斯基元高维特征时的计算限制,以及(2) 蒸馏特征与基础模型特征之间的错位或信息丢失。为解决这些挑战,我们提出了M3,其核心组件包括主场景成分和高斯记忆注意力机制,实现了高效的训练与推理。为验证M3,我们进行了特征相似度和下游任务的全面定量评估,以及定性可视化以突出高斯记忆注意力的像素轨迹。我们的方法涵盖了多种基础模型,包括视觉语言模型(VLMs)、感知模型及大型多模态与语言模型(LMMs/LLMs)。此外,为展示其实际应用价值,我们在四足机器人上部署了M3的特征场于室内场景中。值得注意的是,我们宣称M3是首个解决3D特征蒸馏中核心压缩挑战的工作。
English
We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.

Summary

AI-Generated Summary

PDF152March 21, 2025