视频编辑:零样本和空间感知文本驱动视频编辑
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing
June 14, 2023
作者: Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome
cs.AI
摘要
最近,基于扩散的生成模型在图像生成和编辑方面取得了显著成功。然而,它们在视频编辑方面的应用仍然面临重要限制。本文介绍了一种名为VidEdit的新方法,用于零样本文本驱动视频编辑,确保强大的时间和空间一致性。首先,我们提出将基于图谱的和预训练的文本到图像扩散模型相结合,提供一种无需训练且高效的编辑方法,其设计满足时间平滑性。其次,我们利用现成的全景分割器以及边缘检测器,并调整它们的用途,用于有条件的基于扩散的图谱编辑。这确保了对目标区域进行精细的空间控制,同时严格保留原始视频的结构。定量和定性实验表明,VidEdit在DAVIS数据集上优于最先进的方法,涉及语义保真度、图像保留和时间一致性指标。通过这个框架,处理单个视频仅需大约一分钟,它可以基于唯一的文本提示生成多个兼容的编辑。项目网页链接:https://videdit.github.io
English
Recently, diffusion-based generative models have achieved remarkable success
for image generation and edition. However, their use for video editing still
faces important limitations. This paper introduces VidEdit, a novel method for
zero-shot text-based video editing ensuring strong temporal and spatial
consistency. Firstly, we propose to combine atlas-based and pre-trained
text-to-image diffusion models to provide a training-free and efficient editing
method, which by design fulfills temporal smoothness. Secondly, we leverage
off-the-shelf panoptic segmenters along with edge detectors and adapt their use
for conditioned diffusion-based atlas editing. This ensures a fine spatial
control on targeted regions while strictly preserving the structure of the
original video. Quantitative and qualitative experiments show that VidEdit
outperforms state-of-the-art methods on DAVIS dataset, regarding semantic
faithfulness, image preservation, and temporal consistency metrics. With this
framework, processing a single video only takes approximately one minute, and
it can generate multiple compatible edits based on a unique text prompt.
Project web-page at https://videdit.github.io