M3: 3D空間マルチモーダルメモリ

要旨

本論文では、3D空間マルチモーダルメモリ（M3）を提案する。M3は、ビデオソースを通じた視覚知覚によって中規模な静的なシーンに関する情報を保持するように設計されたマルチモーダルメモリシステムである。3Dガウシアンスプラッティング技術とファウンデーションモデルを統合することで、M3は粒度を超えた特徴表現をレンダリング可能なマルチモーダルメモリを構築し、幅広い知識を包含する。我々の探求において、従来の特徴スプラッティング研究における2つの主要な課題を特定した：(1)各ガウシアンプリミティブに対する高次元特徴を保存する際の計算上の制約、(2)蒸留された特徴とファウンデーションモデルの特徴間の不整合や情報損失。これらの課題に対処するため、主要シーン構成要素とガウシアンメモリアテンションを中核コンポーネントとするM3を提案し、効率的な学習と推論を実現する。M3を検証するため、特徴類似性と下流タスクに関する包括的な定量的評価を行い、ガウシアンメモリアテンションのピクセルトレースを強調する定性的な可視化を実施した。我々のアプローチは、視覚言語モデル（VLM）、知覚モデル、大規模マルチモーダル・言語モデル（LMM/LLM）など、多様なファウンデーションモデルを包含する。さらに、実世界での適用性を示すため、M3の特徴場を四足歩行ロボットに搭載し、屋内シーンで展開した。特に、M3は3D特徴蒸留における中核的な圧縮課題に取り組んだ初めての研究であると主張する。

English

We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.

M3: 3D空間マルチモーダルメモリ

M3: 3D-Spatial MultiModal Memory

要旨

Support