VidEdit: 공간 인식이 가능한 제로샷 텍스트 기반 비디오 편집

초록

최근 디퓨전 기반 생성 모델은 이미지 생성 및 편집 분야에서 놀라운 성과를 거두었습니다. 그러나 비디오 편집에 대한 활용은 여전히 중요한 한계에 직면해 있습니다. 본 논문은 강력한 시간적 및 공간적 일관성을 보장하는 제로샷 텍스트 기반 비디오 편집을 위한 새로운 방법인 VidEdit을 소개합니다. 첫째, 아틀라스 기반 모델과 사전 학습된 텍스트-이미지 디퓨전 모델을 결합하여 학습 없이도 효율적인 편집 방법을 제안하며, 이는 설계상 시간적 부드러움을 충족합니다. 둘째, 기성품 범용 분할기와 에지 검출기를 활용하고, 이를 조건부 디퓨전 기반 아틀라스 편집에 적응시킵니다. 이를 통해 원본 비디오의 구조를 엄격히 보존하면서도 대상 영역에 대한 정교한 공간적 제어가 가능합니다. 정량적 및 정성적 실험 결과, VidEdit은 DAVIS 데이터셋에서 의미론적 충실도, 이미지 보존, 시간적 일관성 지표 측면에서 최신 기술을 능가하는 것으로 나타났습니다. 이 프레임워크를 사용하면 단일 비디오 처리에 약 1분밖에 걸리지 않으며, 단일 텍스트 프롬프트를 기반으로 여러 호환 가능한 편집을 생성할 수 있습니다. 프로젝트 웹페이지는 https://videdit.github.io에서 확인할 수 있습니다.

English

Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io

VidEdit: 공간 인식이 가능한 제로샷 텍스트 기반 비디오 편집

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

초록

Support