ChatPaper.aiChatPaper

Forcing-KV:面向高效自回归视频扩散模型的混合KV缓存压缩

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

May 10, 2026
作者: Yicheng Ji, Zhizhou Zhong, Jun Zhang, Qin Yang, XiTai Jin, Ying Qin, Wenhan Luo, Shuiyang Mao, Wei Liu, Huan Li
cs.AI

摘要

自回歸(AR)視頻擴散模型採用串流生成框架,可實現長時程視頻生成並具備即時回應能力,如Self Forcing訓練範式所示。然而,現有的自回歸視頻擴散模型仍面臨顯著的注意力複雜度問題,且因歷史幀間存在冗餘的鍵值(KV)緩存導致嚴重的記憶體開銷,從而限制了可擴展性。本文針對此挑戰,將KV緩存壓縮引入自回歸視頻擴散中。我們觀察到,主流自回歸擴散模型中的注意力頭在樣本與去噪步驟間呈現出顯著不同的注意力模式與功能角色,且這些特徵保持穩定。基於對注意力頭功能特化性的實證研究,我們將其分為兩類:靜態頭,專注於自回歸區塊間的轉換與幀內保真度;動態頭,負責幀間運動與一致性。隨後提出Forcing-KV,一種混合KV緩存壓縮策略,對靜態頭採用結構化靜態剪枝,對動態頭則基於片段相似度進行動態剪枝。在維持輸出品質的前提下,本方法在單張NVIDIA H200 GPU上可實現超過每秒29幀的生成速度,並減少30%的緩存記憶體,在480P解析度下為LongLive與Self Forcing分別帶來1.35倍與1.50倍的加速,並進一步擴展至1080P解析度下的2.82倍加速。程式碼與示範影片已提供於 https://zju-jiyicheng.github.io/Forcing-KV-Page。
English
Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.