MemFlow: 일관적이고 효율적인 장편 비디오 내러티브를 위한 유동적 적응형 메모리

초록

스트리밍 비디오 생성의 핵심 과제는 장문 콘텍스트에서 내용 일관성을 유지하는 것으로, 이는 메모리 설계에 높은 요구 사항을 제기합니다. 기존 대부분의 솔루션은 미리 정의된 전략으로 과거 프레임을 압축하여 메모리를 유지합니다. 그러나 생성 대상 비디오 청크마다 참조해야 할 과거 단서가 다르기 때문에 고정된 전략으로는 이를 충족하기 어렵습니다. 본 연구에서는 이 문제를 해결하기 위해 MemFlow를 제안합니다. 구체적으로, 다음 청크를 생성하기 전에 해당 청크의 텍스트 프롬프트와 가장 관련성 높은 과거 프레임을 검색하여 메모리 뱅크를 동적으로 업데이트합니다. 이 설계는 향후 프레임에서 새로운 이벤트가 발생하거나 장면이 전환되더라도 내러티브의 일관성을 가능하게 합니다. 또한 생성 과정에서 어텐션 레이어의 각 쿼리마다 메모리 뱅크에서 가장 관련된 토큰만 활성화하여 생성 효율성을 효과적으로 보장합니다. 이를 통해 MemFlow는 미미한 계산 부담(메모리 없는 기준 대비 7.9% 속도 저하)으로 우수한 장문 콘텍스트 일관성을 달성하며, KV 캐시를 사용하는 모든 스트리밍 비디오 생성 모델과의 호환성을 유지합니다.

English

The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.

MemFlow: 일관적이고 효율적인 장편 비디오 내러티브를 위한 유동적 적응형 메모리

MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

초록

Support