Forcing-KV: 効率的な自己回帰型ビデオ拡散モデルのためのハイブリッドKVキャッシュ圧縮

要旨

自己回帰(AR)ビデオ拡散モデルは、ストリーミング生成フレームワークを採用し、Self Forcing学習パラダイムに代表されるように、長期的なビデオ生成をリアルタイム応答性とともに実現する。しかし、既存のARビデオ拡散モデルは、歴史的フレームにわたる冗長なキー・バリュー(KV)キャッシュにより、依然として大きな注意複雑性と深刻なメモリオーバーヘッドを抱えており、スケーラビリティが制限されている。本論文では、自己回帰ビデオ拡散にKVキャッシュ圧縮を導入することで、この課題に取り組む。我々は、主流のAR拡散モデルにおけるアテンションヘッドが、サンプルやノイズ除去ステップ間で安定した、顕著に異なる注意パターンと機能的役割を示すことを観察した。ヘッドごとの機能的特化に関する実証研究に基づき、アテンションヘッドを2つのカテゴリに分類する。すなわち、自己回帰チャンク間の遷移とフレーム内の忠実度に焦点を当てる静的ヘッドと、フレーム間の動きと一貫性を制御する動的ヘッドである。そして、我々はForcing-KVを提案する。これは、静的ヘッドに対して構造的静的プルーニングを、動的ヘッドに対してセグメントごとの類似性に基づく動的プルーニングを実行するハイブリッドKVキャッシュ圧縮戦略である。出力品質を維持しつつ、本手法は単一のNVIDIA H200 GPU上で毎秒29フレーム以上の生成速度と30%のキャッシュメモリ削減を達成し、LongLiveおよびSelf Forcingにおける480P解像度で最大1.35倍および1.50倍の高速化を実現し、さらに1080P解像度では2.82倍の高速化に拡張される。コードとデモ動画はhttps://zju-jiyicheng.github.io/Forcing-KV-Pageで提供されている。

English

Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.