基于时序缓存压缩与稀疏注意力的快速自回归视频扩散及世界模型
Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention
February 2, 2026
作者: Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, Rami Ben-Ari
cs.AI
摘要
自回归视频扩散模型实现了流式生成,为长视频合成、视频世界模型和交互式神经游戏引擎开辟了道路。然而,其核心注意力层在推理时成为主要瓶颈:随着生成进程推进,KV缓存不断增长,导致延迟加剧和GPU内存占用攀升,进而限制可用时序上下文并损害长程一致性。本研究系统分析了自回归视频扩散中的冗余现象,识别出三个持续存在的来源:跨帧存在的近重复缓存键值、演化缓慢(主要承载语义信息)导致大量注意力计算冗余的查询/键值,以及长提示词跨注意力中每帧仅需少量关键标记的特性。基于这些发现,我们提出面向自回归扩散模型的统一免训练注意力框架:TempCache通过时序对应关系压缩KV缓存以限制增长;AnnCA利用快速近似最近邻匹配筛选帧相关提示词标记来加速跨注意力;AnnSA则通过轻量级近似最近邻将每个查询限制在语义匹配的键值范围内,实现自注意力稀疏化。这些模块协同工作可降低注意力计算量和内存占用,且与现有自回归扩散主干网络及世界模型兼容。实验表明,在保持近乎一致视觉质量的同时实现了最高5-10倍的端到端加速,关键是在长序列生成中能维持稳定吞吐量和近乎恒定的GPU峰值内存使用,而现有方法会持续减速且内存占用不断攀升。
English
Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.