ControlVideo：為單次文本到視頻編輯添加條件控制

摘要

本文介紹了ControlVideo，這是一種用於以文字驅動的視頻編輯的新方法。利用文本到圖像擴散模型和ControlNet的能力，ControlVideo旨在增強與給定文本對齊的視頻的保真度和時間一致性，同時保留源視頻的結構。通過將額外條件如邊緣地圖納入其中，通過精心設計的策略在源視頻-文本對上進行關鍵幀和時間注意的微調，實現了這一目標。對ControlVideo設計的深入探討有助於未來研究單次調整視頻擴散模型。從定量上看，ControlVideo在保真度和一致性方面優於一系列競爭基線，同時仍與文本提示保持一致。此外，它提供了具有高視覺逼真度和與源內容相符的視頻，展示了在利用包含不同程度源視頻信息的控制時的靈活性，以及多種控制組合的潛力。項目頁面位於https://ml.cs.tsinghua.edu.cn/controlvideo/。

English

In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations. The project page is available at https://ml.cs.tsinghua.edu.cn/controlvideo/{https://ml.cs.tsinghua.edu.cn/controlvideo/}.

ControlVideo：為單次文本到視頻編輯添加條件控制

ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

摘要

Support