VideoMLA: 面向分钟级自回归视频扩散的低秩潜在键值缓存

摘要

长时程因果视频扩散已收敛于固定大小的滑动窗口KV缓存，近期创新主要围绕改变缓存中保留的token或编码方式展开，但作为流式内存和延迟主要贡献者的逐头KV布局本身基本未变。本文首次研究了视频扩散中的多头潜在注意力机制（MLA）。VideoMLA用共享的低秩内容潜在表示和共享的解耦3D-RoPE位置键替换每个头的键和值，在每层缓存中将每个token的KV内存减少92.7%。我们进一步探究了MLA在视频扩散中成功的机制——尽管语言模型中用于论证MLA有效性的谱假设在视频扩散中并不成立：预训练视频注意力并非低秩，其99%能量有效秩远高于任何实际潜在维度。VideoMLA在直接谱近似会预测较大重构误差的压缩比下仍能保持质量。研究表明，决定有效秩的不是预训练谱，而是MLA瓶颈：谱初始化和随机初始化均从初始化阶段就占据了接近满秩预算，训练过程在保持此预算的同时在其内部进行适应。在VBench上，VideoMLA与短时程流式视频扩散基线匹配，在长时程所有评估方法中取得最佳综合得分，并在单块B200上将吞吐量提升1.23倍。

English

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.