VideoMLA: 분 단위 자기회귀 영상 확산을 위한 저랭크 잠재 KV 캐시

초록

장기 롤아웃 인과적 비디오 확산은 고정 크기 슬라이딩 윈도우 KV 캐시에 수렴되었으며, 최근 연구는 이 레이아웃 내에서 어떤 토큰이 윈도우를 점유하는지 또는 위치가 어떻게 인코딩되는지를 변경하여 혁신을 이루고 있습니다. 스트리밍 메모리와 지연 시간에 주된 기여를 하는 헤드별 KV 레이아웃 자체는 대부분 변경되지 않은 상태로 남아 있습니다. 본 논문에서는 비디오 확산에서 다중 헤드 잠재 어텐션(Multi-Head Latent Attention, MLA)에 대한 최초의 연구를 제시합니다. VideoMLA는 헤드별 키와 값을 공유 저순위 콘텐츠 잠재 변수와 공유 분리된 3D-RoPE 위치 키로 대체하여, 캐시된 모든 레이어에서 토큰당 KV 메모리를 92.7% 감소시킵니다. 또한 언어 모델에서 MLA를 동기화하는 데 자주 사용되는 스펙트럼 가정이 비디오 확산에서는 성립하지 않음에도 불구하고 왜 MLA가 성공하는지 조사합니다. 사전 훈련된 비디오 어텐션은 저순위가 아니며, 99% 에너지 유효 순위가 실용적인 잠재 차원보다 훨씬 높습니다. VideoMLA는 직접적인 스펙트럼 근사가 큰 재구성 오차를 예측하는 압축 비율에서도 품질을 유지합니다. MLA 병목 현상이 사전 훈련된 스펙트럼보다 유효 순위를 결정한다는 것을 보여줍니다. 스펙트럼 초기화와 무작위 초기화 모두 초기화부터 거의 전체 순위 예산을 점유하며, 훈련은 이 예산을 유지하면서 그 안에서 적응합니다. VBench에서 VideoMLA는 단기 지평선 스트리밍 비디오 확산 기준선과 일치하고, 평가된 방법 중 장기 지평선에서 최고의 전체 점수를 달성하며, 단일 B200에서 처리량을 1.23배 향상시킵니다.

English

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.