Control-A-Video: 拡散モデルを用いた制御可能なテキストからビデオ生成

要旨

本論文では、エッジマップや深度マップなどの制御信号のシーケンスに基づいて動画を生成する制御可能なテキスト・ツー・ビデオ（T2V）拡散モデル「Video-ControlNet」を提案する。Video-ControlNetは、事前学習済みの条件付きテキスト・ツー・イメージ（T2I）拡散モデルを基盤として構築され、空間-時間自己注意機構と学習可能な時間層を組み込むことで、効率的なクロスフレームモデリングを実現している。また、画像ドメインからの動画生成や任意の長さの動画を自己回帰的に生成するために、初フレーム条件付け戦略を提案している。さらに、Video-ControlNetは、入力動画から運動の事前情報を導入するための新しい残差ベースのノイズ初期化戦略を採用し、より一貫性のある動画を生成する。提案されたアーキテクチャと戦略により、Video-ControlNetはリソース効率的な収束を達成し、細かい制御を伴う高品質で一貫性のある動画を生成することができる。様々な動画生成タスク（動画編集や動画スタイル転送など）における広範な実験により、一貫性と品質の面で従来の手法を上回る成功を収めていることを示す。プロジェクトページ: https://controlavideo.github.io/

English

This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/

Control-A-Video: 拡散モデルを用いた制御可能なテキストからビデオ生成

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

要旨

Support