Echo-Forcing:一种面向交互式长视频生成的场景记忆框架
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
May 15, 2026
作者: Mingqiang Wu, Weilun Feng, Zhefeng Zhang, Haotong Qin, Yuqi Li, Guoxin Fan, Xiaokun Liu, Zhulin An, Libo Huang, Yongjun Xu, Chuanguang Yang
cs.AI
摘要
自回归视频扩散模型通过局部注意力机制和KV缓存支持开放式生成。然而,现有的免训练长视频优化方法主要聚焦于单一提示词下的稳定扩展,难以处理涉及提示词切换、旧场景遗忘和历史场景回忆的交互式场景。我们识别出核心瓶颈在于历史KV状态的功能耦合:稳定锚点与最近动态被同一缓存策略处理,导致过时背景污染、新提示响应延迟以及长程记忆丢失。为解决该问题,我们提出Echo-Forcing——一种专为交互式长视频生成设计的免训练场景记忆框架,包含三个核心机制:(1) 分层时间记忆,在相对RoPE下解耦稳定锚点、压缩历史与最近窗口;(2) 场景回忆帧,将历史场景压缩为空间结构化的KV表示以支持长程回忆;(3) 差异感知记忆衰减,根据新旧场景差异自适应遗忘冲突令牌。基于这些设计,Echo-Forcing在有限缓存预算下统一支持平滑过渡、硬切变和长程场景回忆。在VBench-Long上的广泛评估进一步表明,Echo-Forcing在长视频生成与交互式视频生成场景中均达到最佳综合性能。我们的代码已发布于 https://github.com/mingqiangWu/Echo-Forcing。
English
Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing