**MosaicMem：面向可控视频世界模型的混合空间记忆框架**

摘要

视频扩散模型正从生成简短可信片段，向构建世界模拟器演进，这类模拟器必须在相机运动、场景重访和交互干预下保持一致性。然而空间记忆仍是关键瓶颈：显式三维结构可通过重投影提升一致性，却难以描绘运动物体；而隐式记忆即使给定正确位姿也常产生不准确的相机运动。我们提出Mosaic记忆（MosaicMem）——一种混合空间记忆，它将图像块提升至三维空间以实现可靠定位与精准检索，同时利用模型原生条件机制保持提示跟随生成能力。通过块组合接口，MosaicMem在查询视图中合成空间对齐的图像块，既保留应持续存在的元素，又允许模型动态修复应演变的区域。结合PRoPE相机条件机制与两种新型记忆对齐方法，实验表明该方法相比隐式记忆具有更优的位姿遵循性，较显式基线则展现出更强的动态建模能力。MosaicMem进一步实现了分钟级导航、基于记忆的场景编辑以及自回归推演等高级功能。

English

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

MosaicMem：面向可控视频世界模型的混合空间记忆框架

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

摘要

Support