视频编辑：零样本和空间感知文本驱动视频编辑

摘要

最近，基于扩散的生成模型在图像生成和编辑方面取得了显著成功。然而，它们在视频编辑方面的应用仍然面临重要限制。本文介绍了一种名为VidEdit的新方法，用于零样本文本驱动视频编辑，确保强大的时间和空间一致性。首先，我们提出将基于图谱的和预训练的文本到图像扩散模型相结合，提供一种无需训练且高效的编辑方法，其设计满足时间平滑性。其次，我们利用现成的全景分割器以及边缘检测器，并调整它们的用途，用于有条件的基于扩散的图谱编辑。这确保了对目标区域进行精细的空间控制，同时严格保留原始视频的结构。定量和定性实验表明，VidEdit在DAVIS数据集上优于最先进的方法，涉及语义保真度、图像保留和时间一致性指标。通过这个框架，处理单个视频仅需大约一分钟，它可以基于唯一的文本提示生成多个兼容的编辑。项目网页链接：https://videdit.github.io

English

Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io

视频编辑：零样本和空间感知文本驱动视频编辑

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

摘要

Support