M3:三維空間多模態記憶
M3: 3D-Spatial MultiModal Memory
March 20, 2025
作者: Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang
cs.AI
摘要
我們提出了三維空間多模態記憶系統(M3),這是一個專為通過視頻源保留中等規模靜態場景信息而設計的多模態記憶系統,用於視覺感知。通過將三維高斯潑濺技術與基礎模型相結合,M3構建了一個能夠跨粒度渲染特徵表示的多模態記憶系統,涵蓋了廣泛的知識領域。在我們的探索中,我們發現了先前關於特徵潑濺工作的兩個關鍵挑戰:(1)存儲每個高斯基元的高維特徵時的計算限制,以及(2)蒸餾特徵與基礎模型特徵之間的錯位或信息丟失。為了解決這些挑戰,我們提出了M3,其關鍵組件包括主場景組件和高斯記憶注意力機制,從而實現高效的訓練和推理。為了驗證M3,我們對特徵相似性和下游任務進行了全面的定量評估,並通過定性可視化來突出高斯記憶注意力機制的像素軌跡。我們的方法涵蓋了多種基礎模型,包括視覺-語言模型(VLM)、感知模型以及大型多模態和語言模型(LMM/LLM)。此外,為了展示其現實世界的適用性,我們在四足機器人上部署了M3的特徵場於室內場景中。值得注意的是,我們聲稱M3是首個解決三維特徵蒸餾中核心壓縮挑戰的工作。
English
We present 3D Spatial MultiModal Memory (M3), a multimodal memory system
designed to retain information about medium-sized static scenes through video
sources for visual perception. By integrating 3D Gaussian Splatting techniques
with foundation models, M3 builds a multimodal memory capable of rendering
feature representations across granularities, encompassing a wide range of
knowledge. In our exploration, we identify two key challenges in previous works
on feature splatting: (1) computational constraints in storing high-dimensional
features for each Gaussian primitive, and (2) misalignment or information loss
between distilled features and foundation model features. To address these
challenges, we propose M3 with key components of principal scene components and
Gaussian memory attention, enabling efficient training and inference. To
validate M3, we conduct comprehensive quantitative evaluations of feature
similarity and downstream tasks, as well as qualitative visualizations to
highlight the pixel trace of Gaussian memory attention. Our approach
encompasses a diverse range of foundation models, including vision-language
models (VLMs), perception models, and large multimodal and language models
(LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy
M3's feature field in indoor scenes on a quadruped robot. Notably, we claim
that M3 is the first work to address the core compression challenges in 3D
feature distillation.