ChatPaper.aiChatPaper

PermaVid: 通过解耦上下文记忆实现跨编辑的一致视频生成

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

June 15, 2026
作者: Shuai Yang, Bingjie Gao, Ziwei Liu, Jiaqi Wang, Dahua Lin, Tong Wu
cs.AI

摘要

在编辑操作下保持一致的视频生成需要持久性:当编辑修改场景外观或布局时,后续生成的内容必须在时间和视角上保持连贯。然而,现有记忆设计在应对此类修改后难以维持长期一致性,因为存储的上下文可能过时或失效。为此,我们提出PermaVid——一个基于多模态上下文记忆的新型框架,该框架将空间上下文解耦为语义外观和几何结构,并结合编辑感知的记忆更新与检索策略,使记忆演化与后续观测保持一致。具体而言,我们构建了两个互补的记忆库:RGB上下文记忆捕获外观感知的观测信息并隐式编码几何结构,深度上下文记忆则保留与语义解耦的纯几何结构。基于此设计,我们引入记忆引导的视频生成模型,该模型在混合模态记忆上下文中提取参考条件,执行多模态特征融合。实验表明,我们的方法在编辑后仍能保持强大的长期语义与结构一致性,显著优于现有最先进方法。
English
Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.