ControlVideo：为一次性文本到视频编辑添加条件控制

摘要

本文介绍了ControlVideo，这是一种用于文本驱动视频编辑的新方法。利用文本到图像扩散模型和ControlNet的能力，ControlVideo旨在增强与给定文本对齐的视频的保真度和时间一致性，同时保留源视频的结构。通过合并额外条件，如边缘映射，在源视频文本对上进行关键帧和时间注意力的微调，并采用精心设计的策略，实现了这一目标。对ControlVideo设计的深入探讨有助于未来研究单次调整视频扩散模型。在定量方面，ControlVideo在忠实度和一致性方面优于一系列竞争基线，同时与文本提示保持一致。此外，它提供了具有高视觉逼真度和源内容保真度的视频，展示了利用包含不同程度源视频信息的控制以及多种控制组合的灵活性。项目页面位于https://ml.cs.tsinghua.edu.cn/controlvideo/。

English

In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations. The project page is available at https://ml.cs.tsinghua.edu.cn/controlvideo/{https://ml.cs.tsinghua.edu.cn/controlvideo/}.

ControlVideo：为一次性文本到视频编辑添加条件控制

ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

摘要

Support