PermaVid: 通过解耦上下文记忆实现跨编辑的一致视频生成

摘要

在编辑操作下保持一致的视频生成需要持久性：当编辑修改场景外观或布局时，后续生成的内容必须在时间和视角上保持连贯。然而，现有记忆设计在应对此类修改后难以维持长期一致性，因为存储的上下文可能过时或失效。为此，我们提出PermaVid——一个基于多模态上下文记忆的新型框架，该框架将空间上下文解耦为语义外观和几何结构，并结合编辑感知的记忆更新与检索策略，使记忆演化与后续观测保持一致。具体而言，我们构建了两个互补的记忆库：RGB上下文记忆捕获外观感知的观测信息并隐式编码几何结构，深度上下文记忆则保留与语义解耦的纯几何结构。基于此设计，我们引入记忆引导的视频生成模型，该模型在混合模态记忆上下文中提取参考条件，执行多模态特征融合。实验表明，我们的方法在编辑后仍能保持强大的长期语义与结构一致性，显著优于现有最先进方法。

English

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.