FadeMem：自己回帰ビデオ拡散のための距離認識型メモリ統合

要旨

自己回帰型ビデオジェネレータは、連続する時間セグメントを生成することで長尺動画を合成するが、履歴KVキャッシュは動画の長さに比例して増大する。既存の有界キャッシュ手法では、ローカルウィンドウ、シンクトークン、圧縮メモリ状態を用いてこのコストを削減するものの、通常、履歴の異なる部分に固定された役割を割り当てている。本論文では、FadeMemを提案する。これは距離を考慮したKVメモリ統合メカニズムであり、固定キャッシュ予算の下で過去のKVブロックを時間階層に整理する。この設計は、周波数依存の時間減衰に着想を得ている。すなわち、微細な詳細は急速に無相関化する一方、大まかなシーン構造や同一性は長い時間にわたって有用性を保つ。生成中、新しい履歴は細粒度のエントリとして挿入され、古い隣接エントリはべき乗則に従った時間配分スケジュールの下で段階的に統合され、一つのキャッシュ内に密近疎遠なメモリを形成する。アーキテクチャの変更を伴わずに、FadeMemは短期的なダイナミクスのための最近のコンテキストと、同一性やシーンの一貫性のためのコンパクトな長距離アンカーを保持する。実験では、既存の有界キャッシュ手法と比較して、被写体の一貫性、背景の安定性、時間的一貫性が向上することを示している。

English

Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.