DecMem: 分離メモリを用いた分単位の一貫性のある世界生成に向けて

要旨

近年、動画生成モデルの進歩により、制御可能な世界モデルの急速な発展が促進されている。しかし、長期的な推論の下で細粒度の時空間一貫性を維持することは依然として主要な課題である。本研究では、明示的な3D記憶や粗いフレームレベルの暗黙的モデリングを超え、一貫した世界生成のための細粒度で学習可能かつスケーラブルな記憶を提案する。まず、ナイーブな学習可能記憶アーキテクチャが長期的な外挿において抱える2つの基本的制約、すなわち計算非効率性と注意分散を特定する。注意分散の体系的分析を通じて、我々はDecMemを提案する。これは、効率的な細粒度アクセスによるグローバルな履歴へのアクセスを実現するスパースグローバルメモリと、安定かつ高品質な外挿を実現するアンカードローカルメモリを採用した、分離型記憶アーキテクチャである。大規模な実験により、DecMemが現在の最先端手法を大幅に上回る性能を示すことが実証された。正確かつ効率的な長期記憶を保証し、優れた外挿能力を達成することで、DecMemは高忠実度と一貫性を備えた分単位の制御可能な長時間動画生成を可能にする。

English

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.