FadeMem：用於自迴歸影片擴散的距離感知記憶鞏固

摘要

自回归视频生成器通过生成连续的时间片段来合成长视频，但其历史键值缓存会随视频长度增长而膨胀。现有的有界缓存方法通过局部窗口、汇合令牌或压缩记忆状态来降低该成本，但它们通常为历史信息的不同部分分配固定角色。我们提出FadeMem，一种距离感知的键值内存整合机制，在固定缓存预算下将历史键值块组织成时间层级结构。该设计源于频率依赖的时间衰减规律：精细细节迅速去相关，而粗粒度场景结构与主体特征在更长时段内保持有效性。生成过程中，新历史以细粒度条目插入，而相邻的旧条目则在幂律时间分配调度下逐步合并，形成缓存内近密远疏的记忆模式。无需架构改动，FadeMem即可保留近期上下文以捕捉短时动态，同时通过紧凑的远距离锚点维持主体一致性与场景连贯性。实验表明，与现有有界缓存策略相比，该方法在主体一致性、背景稳定性和时间连贯性上均有提升。

English

Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.