DecMem：邁向基於解耦記憶的時長分鐘級一致世界生成

摘要

近年來，視頻生成模型的進展顯著推動了可控世界模型的快速發展。然而，在長時域推理中維持精細的時空一致性仍是關鍵挑戰。本研究突破傳統顯式3D記憶與粗粒度幀級隱式建模，提出一種細粒度、可學習且可擴展的記憶機制，以實現一致性世界生成。我們首先識別出樸素可學習記憶架構在長時域外推中的兩項根本限制：計算效率低下與注意力分散。透過對注意力分散的系統性分析，我們提出解耦記憶架構DecMem，其中採用稀疏全局記憶實現對全局歷史的高效細粒度存取，並結合錨定局部記憶確保穩定且高品質的外推。大量實驗證明，DecMem顯著優於當前最先進方法。藉由確保精準且高效的長期記憶機制，並展現卓越的外推能力，DecMem能實現分鐘級可控長視頻生成，同時維持高保真度與一致性。

English

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.