VideoMLA：面向分鐘級自迴歸視頻擴散的低秩潛在KV快取

摘要

長時程因果視訊擴散模型已收斂於固定大小的滑動視窗KV快取，近期進展透過改變哪些Token佔據視窗或如何編碼其位置，在此框架內進行創新。然而，作為串流記憶體與延遲主要貢獻者的每頭KV佈局本身，大多維持不變。本文首次針對多頭潛在注意力（MLA）在視訊擴散中的應用進行研究。VideoMLA以共享的低秩內容潛在變量與共享的解耦3D-RoPE位置鍵，取代每頭鍵與值，從而在每個快取層將每Token的KV記憶體減少92.7%。我們進一步探討了為何MLA在視訊擴散中成功，儘管其常用於語言模型中的頻譜假設並不成立：預訓練視訊注意力並非低秩，其99%能量的有效秩遠高於任何實際潛在維度。VideoMLA在直接頻譜近似會預測出較大重建誤差的壓縮比下，仍能保持品質。我們證明，MLA瓶頸（而非預訓練頻譜）決定了有效秩：頻譜初始化與隨機初始化在初始化時均佔據接近完整的秩預算，而訓練在保持此預算的同時，於其內部進行調適。在VBench評估中，VideoMLA在短時程串流視訊擴散基準上表現匹配，在長時程場景下於評比方法中取得最佳整體分數，並在單顆B200上將吞吐量提升1.23倍。

English

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.