I2VEdit：基于图像到视频扩散的首帧引导视频编辑模型

摘要

扩散模型卓越的生成能力激发了图像和视频编辑方面的广泛研究。与视频编辑在时间维度上面临额外挑战相比，图像编辑已经见证了更多样化、高质量方法的发展，以及诸如Photoshop等更强大的软件。鉴此差距，我们提出了一种新颖且通用的解决方案，通过使用预训练的图像到视频模型，将编辑从单帧传播到整个视频，从而将图像编辑工具的适用性扩展到视频领域。我们的方法名为I2VEdit，根据编辑的程度，能够自适应地保留源视频的视觉和运动完整性，有效处理全局编辑、局部编辑和适度形状变化，这是现有方法无法完全实现的。我们方法的核心包括两个主要过程：粗糙运动提取，用于将基本运动模式与原始视频对齐，以及外观细化，通过细粒度的注意力匹配进行精确调整。我们还采用了跳跃间隔策略，以减轻由于跨多个视频片段的自回归生成而导致的质量下降。实验结果展示了我们框架在细粒度视频编辑方面的卓越表现，证明了其能够产生高质量、时间上连贯的输出。

English

The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework's superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.

I2VEdit：基于图像到视频扩散的首帧引导视频编辑模型

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

摘要

Support