PermaVid: 분리된 컨텍스트 메모리를 통한 편집 간 일관된 비디오 생성

초록

편집 작업 하에서 일관된 비디오 생성을 위해서는 지속성이 필요하다. 편집이 장면의 외형이나 배치를 수정할 때, 이후 생성되는 결과물은 시간과 시점에 걸쳐 일관성을 유지해야 한다. 그러나 기존의 메모리 설계는 저장된 컨텍스트가 구식이 되거나 무효화될 수 있기 때문에, 이러한 수정 이후 장기적 일관성을 유지하는 데 어려움을 겪는다. 이 문제를 해결하기 위해, 우리는 공간적 컨텍스트를 의미론적 외형과 기하학적 구조로 분리하는 다중 모달 컨텍스트 메모리, 그리고 메모리 진화를 이후 관측과 일치시키는 편집 인지 메모리 업데이트 및 검색 전략을 기반으로 하는 새로운 프레임워크인 PermaVid를 제안한다. 구체적으로, 우리는 외형 인지 관측을 포착하면서 기하학을 암시적으로 인코딩하는 RGB 컨텍스트 메모리와, 의미론과 분리된 기하학 전용 구조를 보존하는 깊이 컨텍스트 메모리라는 상호 보완적인 두 개의 메모리 뱅크를 개발한다. 이 설계를 바탕으로, 혼합 모달 메모리 컨텍스트에서 추출된 참조 조건 하에 다중 모달 특징 융합을 수행하는 메모리 유도 비디오 생성 모델을 도입한다. 실험 결과, 우리의 방법은 편집 후에도 강력한 장기적 의미 및 구조적 일관성을 유지하며, 최첨단 방법들을 크게 능가함을 보여준다.

English

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.