FadeMem: 자기회귀 비디오 확산을 위한 거리 인식 메모리 통합

초록

자기회귀 비디오 생성기는 연속적인 시간적 세그먼트를 생성하여 긴 비디오를 합성하지만, 비디오 길이가 증가함에 따라 히스토리 KV 캐시도 함께 커진다. 기존의 캐시 제한 방법들은 로컬 윈도우, 싱크 토큰, 또는 압축된 메모리 상태를 통해 이러한 비용을 줄이지만, 대개 히스토리의 서로 다른 부분에 고정된 역할을 할당한다. 본 논문에서는 FadeMem을 제안한다. 이는 거리 인지형 KV 메모리 통합 메커니즘으로, 고정된 캐시 예산 하에서 히스토리 KV 블록들을 시간적 계층 구조로 조직한다. 이러한 설계는 주파수 의존적 시간적 감쇠에 기반한다. 즉, 미세한 세부사항은 빠르게 상관관계가 사라지는 반면, 대략적인 장면 구조와 정체성은 더 긴 시간 범위에서 유용하게 남는다. 생성 과정에서 새로운 히스토리는 세밀한 항목으로 삽입되고, 오래된 인접 항목들은 멱법칙 시간적 할당 일정에 따라 점진적으로 병합되어, 하나의 캐시 내에서 밀집-근거리, 희소-원거리 메모리를 형성한다. 아키텍처 변경 없이, FadeMem은 단기 역학을 위한 최근 맥락과 정체성 및 장면 일관성을 위한 간결한 장거리 앵커를 유지한다. 실험 결과, 기존 캐시 제한 전략들에 비해 주제 일관성, 배경 안정성, 시간적 일관성이 개선됨을 보여준다.

English

Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.