MosaicMem：用於可控影片世界模型的混合空間記憶體

摘要

影片擴散模型正從生成短暫的合理片段，邁向能實現世界模擬的階段，這類模型必須在攝影機運動、場景重訪與互動干預下保持一致性。然而空間記憶仍是關鍵瓶頸：顯式3D結構雖能提升基於重投影的一致性，卻難以描繪運動物體；而隱式記憶即使具備正確姿態，仍常產生不準確的攝影機運動。我們提出馬賽克記憶（MosaicMem），一種混合式空間記憶架構，能將圖塊提升至3D空間實現可靠定位與定向檢索，同時利用模型原生條件機制維持提示跟隨生成能力。MosaicMem透過圖塊組合介面，在查詢視角中合成空間對齊的圖塊，既保留應持續存在的內容，又允許模型對需演變的區域進行修補。結合PRoPE攝影機條件技術與兩種新型記憶對齊方法，實驗顯示相較隱式記憶有更優的姿態遵循性，且比顯式基線模型具備更強的動態建模能力。MosaicMem進一步實現了分鐘級導航、基於記憶的場景編輯，以及自回歸滾動生成功能。

English

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

MosaicMem：用於可控影片世界模型的混合空間記憶體

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

摘要

Support