DecMem: 분리된 메모리를 활용한 수 분간 일관된 세계 생성

초록

최근 비디오 생성 모델의 발전은 제어 가능한 월드 모델에서 빠른 진전을 촉진했습니다. 그러나 장기 추론 하에서 세밀한 시공간 일관성을 유지하는 것은 여전히 주요 과제로 남아 있습니다. 본 연구에서는 명시적 3D 메모리와 거친 프레임 수준의 암시적 모델링을 넘어, 일관된 월드 생성을 위한 세밀하고 학습 가능하며 확장 가능한 메모리를 제안합니다. 먼저, 장기 외삽에서 단순한 학습 가능 메모리 아키텍처의 두 가지 근본적 한계, 즉 계산 비효율성과 주의 분산을 식별했습니다. 주의 분산에 대한 체계적 분석을 통해, 전역 이력에 대한 효율적인 세밀 접근을 위한 희소 전역 메모리와 안정적이고 고품질의 외삽을 위한 고정 로컬 메모리를 사용하는 분리된 메모리 아키텍처인 DecMem을 제안합니다. 광범위한 실험을 통해 DecMem이 현재 최첨단 방법들을 크게 능가함을 입증했습니다. 정확하고 효율적인 장기 메모리를 보장하고 우수한 외삽 능력을 달성함으로써, DecMem은 높은 충실도와 일관성을 갖춘 분 단위 제어 가능한 긴 비디오 생성을 가능하게 합니다.

English

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.