VidEdit：零樣本和空間感知的文本驅動視頻編輯

摘要

最近，基於擴散的生成模型在圖像生成和編輯方面取得了顯著的成功。然而，它們在視頻編輯方面仍然面臨重要限制。本文介紹了VidEdit，一種新型的零樣本基於文本的視頻編輯方法，確保了強大的時間和空間一致性。首先，我們提出將基於地圖的和預訓練的文本到圖像擴散模型相結合，提供一種無需訓練且高效的編輯方法，其設計符合時間平滑性。其次，我們利用現成的全景分割器以及邊緣檢測器，並適應它們的用法來進行條件擴散式地圖編輯。這確保了對目標區域進行精細的空間控制，同時嚴格保留原始視頻的結構。定量和定性實驗表明，VidEdit在DAVIS數據集上優於最先進的方法，涉及語義忠實度、圖像保留和時間一致性指標。有了這個框架，處理一個視頻僅需大約一分鐘，並且可以基於唯一的文本提示生成多個兼容的編輯。項目網頁位於https://videdit.github.io。

English

Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io

VidEdit：零樣本和空間感知的文本驅動視頻編輯

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

摘要

Support