FadeMem：面向自回归视频扩散的距离感知记忆整合

摘要

自回归视频生成器通过生成连续的时间片段来合成长视频，但其历史键值缓存会随视频长度增长而膨胀。现有有界缓存方法通过局部窗口、汇聚令牌或压缩记忆状态来降低这一开销，但通常为历史不同部分分配固定角色。我们提出FadeMem——一种距离感知的键值记忆体整合机制，在固定缓存预算下将历史键值块组织成时间层级结构。该设计源于频率依赖的时间衰减规律：细节特征快速解相关，而粗粒度场景结构与主体特征在更长时域内保持有效。生成过程中，新历史以细粒度条目插入，而邻近旧条目在幂律时间分配调度下逐步合并，形成缓存内部的"近密远疏"记忆。无需修改架构，FadeMem即可为短期动态保留近期上下文，同时为身份与场景连贯性保留紧凑的远程锚点。实验表明，与现有有界缓存策略相比，该方法在主体一致性、背景稳定性及时间连贯性方面均有提升。

English

Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.