エコーフォーシング：インタラクティブな長編動画生成のためのシーン記憶フレームワーク

要旨

自己回帰ビデオ拡散モデルは、局所注意とKVキャッシングを通じて、オープンエンドな生成を可能にする。しかし、既存の学習不要な長尺ビデオ最適化手法は、主に単一プロンプト下での安定した拡張に焦点を当てており、プロンプトの切り替え、古いシーンの忘却、過去のシーンの想起といったインタラクティブなシナリオへの対応が困難である。我々は、その核心的なボトルネックが過去のKV状態の機能的絡み合いにあると特定する。すなわち、安定したアンカーと最近のダイナミクスが同一のキャッシュポリシーによって扱われるため、古くなった背景の汚染、新しいプロンプトへの応答遅延、長距離記憶の喪失を引き起こす。この問題に対処するため、我々はEcho-Forcingを提案する。これは、インタラクティブな長尺ビデオ生成に特化した学習不要のシーンメモリフレームワークであり、以下の3つの中核メカニズムを備える。（1）階層的時間記憶：相対RoPEの下で、安定アンカー、圧縮履歴、最近のウィンドウを分離する。（2）シーン想起フレーム：過去のシーンを空間構造化されたKV表現に圧縮し、長期的な想起を支援する。（3）差分認識メモリ減衰：新旧シーン間の差異に応じて、競合するトークンを適応的に忘却する。これらの設計に基づき、Echo-Forcingは、制限付きキャッシュ予算の下で、スムーズな遷移、ハードカット、長距離シーン想起を統一的にサポートする。VBench-Longにおける広範な評価により、Echo-Forcingが長尺ビデオ生成とインタラクティブビデオ生成の両方の設定で最良の総合性能を達成することを実証した。我々のコードはhttps://github.com/mingqiangWu/Echo-Forcingで公開されている。

English

Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing