VideoMLA: 分単位の自己回帰型ビデオ拡散のための低ランク潜在KVキャッシュ

要旨

長期展開型の因果ビデオ拡散モデルは、固定サイズのスライディングウィンドウKVキャッシュに収束しており、最近の進歩は、ウィンドウを占有するトークンやその位置のエンコード方法を変更することで、このレイアウト内での革新を進めてきた。しかし、ストリーミング時のメモリとレイテンシの主要な要因であるヘッドごとのKVレイアウト自体は、ほとんど変更されていない。本論文では、ビデオ拡散におけるマルチヘッド潜在アテンション（MLA）の初めての研究を提示する。VideoMLAは、ヘッドごとのキーと値を、共有の低ランクなコンテンツ潜在変数と、共有の非結合型3D-RoPE位置キーに置き換えることで、キャッシュされた各層におけるトークンあたりのKVメモリを92.7%削減する。さらに、言語モデルでMLAを動機づけるためにしばしば用いられるスペクトル仮定が成立しない状況でも、なぜVideoMLAがビデオ拡散で成功するのかを調査する。事前学習されたビデオアテンションは低ランクではなく、99%エネルギーの有効ランクは、実用的な潜在次元をはるかに上回っている。VideoMLAは、直接的なスペクトル近似では大きな再構成誤差が予測される圧縮率においても品質を維持する。MLAのボトルネックが、事前学習されたスペクトルではなく、有効ランクを決定することを示す。スペクトル初期化とランダム初期化の両方が、初期化時点でほぼ全ランク予算を占有し、学習はこの予算を維持しつつ、その範囲内で適応を行う。VBenchにおいて、VideoMLAは短期的なストリーミングビデオ拡散ベースラインと同等の性能を示し、長期的な評価対象手法の中で最高の総合スコアを達成し、単一のB200上でスループットを1.23倍向上させる。

English

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.