ControlVideo: ワンショットテキスト・トゥ・ビデオ編集のための条件付き制御の追加

要旨

本論文では、テキスト駆動型ビデオ編集のための新手法ControlVideoを提案する。ControlVideoは、テキストから画像を生成する拡散モデルとControlNetの能力を活用し、与えられたテキストに沿ったビデオの忠実度と時間的一貫性を向上させつつ、元のビデオの構造を保持することを目指す。これは、エッジマップなどの追加条件を組み込み、慎重に設計された戦略に基づいてソースビデオとテキストのペアに対するキーフレームと時間的注意を微調整することで実現される。ControlVideoの設計について詳細に探求し、ワンショットチューニングビデオ拡散モデルの将来の研究に貢献する。定量的には、ControlVideoは忠実度と一貫性の点で競合するベースラインを上回りながら、テキストプロンプトに沿った結果を示す。さらに、ソースコンテンツに対する高い視覚的リアリズムと忠実度を備えたビデオを提供し、さまざまな程度のソースビデオ情報を含むコントロールを柔軟に活用する可能性と、複数のコントロールの組み合わせの可能性を示す。プロジェクトページはhttps://ml.cs.tsinghua.edu.cn/controlvideo/{https://ml.cs.tsinghua.edu.cn/controlvideo/}で公開されている。

English

In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations. The project page is available at https://ml.cs.tsinghua.edu.cn/controlvideo/{https://ml.cs.tsinghua.edu.cn/controlvideo/}.

ControlVideo: ワンショットテキスト・トゥ・ビデオ編集のための条件付き制御の追加

ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

要旨

Support