PermaVid：透過解纏結上下文記憶實現跨編輯的一致影片生成

摘要

在編輯操作下生成一致的影片需要具備持續性：當編輯改變場景外觀或佈局時，後續生成的內容應在時間與視角上保持連貫。然而，現有的記憶設計在面對此類修改時難以維持長期一致性，因為儲存的上下文可能變得過時或無效。為解決此問題，我們提出 PermaVid，一個基於多模態上下文記憶的新穎框架，該記憶將空間上下文分離為語意外觀與幾何結構，並搭配編輯感知的記憶更新與檢索策略，使記憶演化與後續觀測保持一致。具體而言，我們開發了兩個互補的記憶庫：RGB 上下文記憶捕捉外觀感知的觀測並隱式編碼幾何資訊，以及深度上下文記憶保留僅含幾何結構的表示，使其與語義分離。在此設計基礎上，我們引入一個記憶引導的影片生成模型，該模型在混合模態記憶上下文提供的參考條件下執行多模態特徵融合。實驗證明，我們的方法在編輯後能維持強大的長期語義與結構一致性，顯著優於現有最先進方法。

English

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.