Forcing-KV: 효율적인 자기회귀 비디오 확산 모델을 위한 하이브리드 KV 캐시 압축

초록

자기회귀 확산 비디오 모델은 스트리밍 생성 프레임워크를 채택하여 실시간 응답성을 갖춘 장기 비디오 생성을 가능하게 하며, Self Forcing 훈련 패러다임이 그 예시입니다. 그러나 기존 자기회귀 확산 비디오 모델은 역사적 프레임 간의 중복된 키-값 캐시로 인해 여전히 상당한 어텐션 복잡도와 심각한 메모리 오버헤드를 겪으며, 이는 확장성을 제한합니다. 본 논문에서는 자기회귀 비디오 확산에 KV 캐시 압축을 도입하여 이 문제를 해결합니다. 우리는 주류 자기회귀 확산 모델의 어텐션 헤드가 샘플과 잡음 제거 단계 전반에 걸쳐 안정적으로 유지되는 현저히 구별되는 어텐션 패턴과 기능적 역할을 나타냄을 관찰했습니다. 헤드별 기능적 전문화에 대한 실증적 연구를 바탕으로, 어텐션 헤드를 두 가지 범주로 나눕니다. 정적 헤드는 자기회귀 청크 간의 전환과 프레임 내 충실도에 초점을 맞추고, 동적 헤드는 프레임 간 움직임과 일관성을 관장합니다. 그런 다음 Forcing-KV를 제안합니다. 이는 정적 헤드에 대해 구조적 정적 가지치기를 수행하고 동적 헤드에 대해 세그먼트별 유사성에 기반한 동적 가지치기를 수행하는 하이브리드 KV 캐시 압축 전략입니다. 출력 품질을 유지하면서도, 본 방법은 단일 NVIDIA H200 GPU에서 초당 29프레임 이상의 생성 속도와 함께 30%의 캐시 메모리 감소를 달성하며, 480P 해상도의 LongLive 및 Self Forcing에서 각각 최대 1.35배 및 1.50배의 속도 향상을 제공하고, 1080P 해상도에서는 2.82배의 속도 향상으로 확장됩니다. 코드 및 데모 비디오는 https://zju-jiyicheng.github.io/Forcing-KV-Page 에서 제공됩니다.

English

Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.