MemFlow：一貫性と効率性を備えた長尺ビデオナレーションのための適応的メモリフロー

要旨

ストリーミング動画生成の中核的課題は、長いコンテキストにおける内容の一貫性を維持することであり、これはメモリ設計に対して高い要求を課す。既存の手法の多くは、事前に定義された戦略で過去フレームを圧縮することでメモリを維持している。しかし、生成対象の動画チャンクごとに参照すべき過去の手がかりは異なり、固定された戦略ではこれを満たすのは困難である。本研究では、この問題を解決するためにMemFlowを提案する。具体的には、新しいチャンクを生成する前に、そのチャンクのテキストプロンプトと最も関連性の高い過去フレームを検索してメモリバンクを動的に更新する。この設計により、後のフレームで新たなイベントが発生したりシーンが切り替わったりする場合でも、物語の一貫性を保つことができる。さらに、生成時にはアテンション層において、各クエリに対してメモリバンク内の最も関連性の高いトークンのみを活性化し、生成効率を効果的に保証する。これにより、MemFlowは計算負荷を無視できる程度（メモリを使用しないベースラインと比較して7.9%の速度低下のみ）に抑えつつ、優れた長文脈一貫性を実現し、KVキャッシュを備えた任意のストリーミング動画生成モデルとの互換性を維持する。

English

The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.

MemFlow：一貫性と効率性を備えた長尺ビデオナレーションのための適応的メモリフロー

MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

要旨

Support