ChatPaper.aiChatPaper

VidEdit:零樣本和空間感知的文本驅動視頻編輯

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

June 14, 2023
作者: Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome
cs.AI

摘要

最近,基於擴散的生成模型在圖像生成和編輯方面取得了顯著的成功。然而,它們在視頻編輯方面仍然面臨重要限制。本文介紹了VidEdit,一種新型的零樣本基於文本的視頻編輯方法,確保了強大的時間和空間一致性。首先,我們提出將基於地圖的和預訓練的文本到圖像擴散模型相結合,提供一種無需訓練且高效的編輯方法,其設計符合時間平滑性。其次,我們利用現成的全景分割器以及邊緣檢測器,並適應它們的用法來進行條件擴散式地圖編輯。這確保了對目標區域進行精細的空間控制,同時嚴格保留原始視頻的結構。定量和定性實驗表明,VidEdit在DAVIS數據集上優於最先進的方法,涉及語義忠實度、圖像保留和時間一致性指標。有了這個框架,處理一個視頻僅需大約一分鐘,並且可以基於唯一的文本提示生成多個兼容的編輯。項目網頁位於https://videdit.github.io。
English
Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io
PDF61December 15, 2024