ChatPaper.aiChatPaper

VideoSSM:基于混合状态空间记忆的自回归长视频生成

VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

December 4, 2025
作者: Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi
cs.AI

摘要

自回归扩散模型通过因果帧生成实现了流式交互式长视频生成,但由于误差累积、运动漂移和内容重复等问题,在分钟级时间尺度上保持连贯性仍具挑战。我们从记忆视角切入,将视频合成视为需要协调长短时上下文信息的循环动态过程,提出VideoSSM——一种融合自回归扩散与混合状态空间记忆的长视频模型。状态空间模型作为贯穿整个序列的场景动态演化全局记忆,而上下文窗口则为运动线索和细节提供局部记忆。这种混合设计在避免画面冻结和模式重复的同时保持全局一致性,支持提示词自适应交互,并以序列长度的线性时间实现扩展。在短长程基准测试中,该模型在自回归视频生成器中展现出最先进的时序一致性和运动稳定性,尤其在分钟级尺度上表现出色,能够实现内容多样性及基于提示词的交互控制,从而为长视频生成建立了可扩展的记忆感知框架。
English
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.
PDF32December 13, 2025