基于时序缓存压缩与稀疏注意力的快速自回归视频扩散及世界模型

摘要

自回归视频扩散模型实现了流式生成，为长视频合成、视频世界模型和交互式神经游戏引擎开辟了道路。然而，其核心注意力层在推理时成为主要瓶颈：随着生成进程推进，KV缓存不断增长，导致延迟加剧和GPU内存占用攀升，进而限制可用时序上下文并损害长程一致性。本研究系统分析了自回归视频扩散中的冗余现象，识别出三个持续存在的来源：跨帧存在的近重复缓存键值、演化缓慢（主要承载语义信息）导致大量注意力计算冗余的查询/键值，以及长提示词跨注意力中每帧仅需少量关键标记的特性。基于这些发现，我们提出面向自回归扩散模型的统一免训练注意力框架：TempCache通过时序对应关系压缩KV缓存以限制增长；AnnCA利用快速近似最近邻匹配筛选帧相关提示词标记来加速跨注意力；AnnSA则通过轻量级近似最近邻将每个查询限制在语义匹配的键值范围内，实现自注意力稀疏化。这些模块协同工作可降低注意力计算量和内存占用，且与现有自回归扩散主干网络及世界模型兼容。实验表明，在保持近乎一致视觉质量的同时实现了最高5-10倍的端到端加速，关键是在长序列生成中能维持稳定吞吐量和近乎恒定的GPU峰值内存使用，而现有方法会持续减速且内存占用不断攀升。

English

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

基于时序缓存压缩与稀疏注意力的快速自回归视频扩散及世界模型

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

摘要

Support