MosaicMem: 制御可能なビデオ世界モデルのためのハイブリッド空間メモリ

要旨

ビデオ拡散モデルは、短く説得力のあるクリップを超えて、カメラ運動・再訪・介入の下で一貫性を維持する必要がある世界シミュレーターへと進化しつつある。しかし、空間的記憶は依然として主要なボトルネックである。明示的な3D構造は再投影ベースの一貫性を改善できるが、移動物体の描写には課題があり、暗黙的記憶は正しい姿勢が与えられても不正確なカメラ運動を生成することが多い。我々はMosaic Memory（MosaicMem）を提案する。これは、信頼性の高い位置推定と対象を絞った検索のためにパッチを3D空間にリフトアップしつつ、モデル本来の条件付けを活用してプロンプト追従型生成を維持するハイブリッド空間記憶である。MosaicMemは、パッチ合成インターフェースを介してクエリ視点で空間的に整列したパッチを構成し、持続すべき要素を保持しながら、進化すべき要素のインペイントをモデルに委ねる。PRoPEカメラ条件付けと2つの新しいメモリ位置合わせ手法により、実験では暗黙的記憶と比較して姿勢遵守性が向上し、明示的ベースラインよりも強力な動的モデリングが実現された。MosaicMemはさらに、分単位のナビゲーション、メモリベースのシーン編集、および自己回帰的ロールアウトを可能にする。

English

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

MosaicMem: 制御可能なビデオ世界モデルのためのハイブリッド空間メモリ

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

要旨

Support