시간적 캐시 압축과 희소 어텐션을 활용한 고속 자기회귀 비디오 확산 및 세계 모델

초록

자기회귀 비디오 확산 모델은 스트리밍 생성이 가능하여 장편 비디오 합성, 비디오 월드 모델, 상호작용형 신경망 게임 엔진 구현의 길을 열었습니다. 그러나 이러한 모델의 핵심 구성 요소인 어텐션 레이어는 추론 시점에 주요 병목 현상으로 작용합니다: 생성이 진행됨에 따라 KV 캐시가 증가하여 지연 시간이 점차 길어지고 GPU 메모리 사용량이 급증하며, 이는 사용 가능한 시간적 컨텍스트를 제한하고 장기간 일관성을 해치는 결과를 초래합니다. 본 연구에서는 자기회귀 비디오 확산 모델에서 나타나는 중복성을 분석하고 세 가지 지속적인 원인을 규명했습니다: 프레임 간에 거의 중복되는 캐시된 키, 많은 어텐션 계산을 중복시키는 느리게 변화하는(주로 의미론적인) 쿼리/키, 그리고 프레임마다 극히 일부 토큰만이 중요한 긴 프롬프트에 대한 교차 어텐션입니다. 이러한 관찰을 바탕으로, 우리는 자기회귀 확산 모델을 위한 통합적이며 훈련이 필요 없는 어텐션 프레임워크를 제안합니다: TempCache는 시간적 대응 관계를 통해 KV 캐시를 압축하여 캐시 증가를 제한하고, AnnCA는 빠른 근사 최근접 이웃(ANN) 매칭을 사용하여 프레임과 관련된 프롬프트 토큰을 선택하여 교차 어텐션을 가속화하며, AnnSA는 가벼운 ANN을 사용하여 각 쿼리를 의미론적으로 일치하는 키로 제한하여 자기 어텐션을 희소화합니다. 이러한 모듈들은 함께 어텐션, 계산량, 메모리 사용량을 줄이며, 기존의 자기회귀 확산 백본 및 월드 모델과 호환됩니다. 실험 결과, 기존 방법들이 점차 느려지고 메모리 사용량이 증가하는 장기 롤아웃 상황에서도 시각적 품질을 거의 동일하게 유지하면서 최대 5~10배의 종단 간 속도 향상을 달성했으며, 무엇보다도 안정적인 처리량과 거의 일정한 최대 GPU 메모리 사용량을 유지하는 것으로 나타났습니다.

English

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

시간적 캐시 압축과 희소 어텐션을 활용한 고속 자기회귀 비디오 확산 및 세계 모델

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

초록

Support