ControlVideo: トレーニング不要の制御可能なテキスト-to-ビデオ生成

要旨

テキスト駆動型拡散モデルは画像生成において前例のない能力を解き放ったが、その動画版は時間的モデリングの過剰なトレーニングコストにより依然として遅れを取っている。トレーニング負荷に加えて、生成された動画は特に長尺動画合成において、外観の不整合や構造的なちらつきに悩まされている。これらの課題に対処するため、我々は自然で効率的なテキストから動画への生成を可能にするトレーニング不要のフレームワーク「ControlVideo」を設計した。ControlVideoはControlNetを基に、入力されたモーションシーケンスから大まかな構造的一貫性を活用し、動画生成を改善するための3つのモジュールを導入している。まず、フレーム間の外観の一貫性を確保するため、ControlVideoはセルフアテンションモジュールに完全なクロスフレーム相互作用を追加する。次に、ちらつき効果を軽減するため、交互フレームにフレーム補間を適用するインターレースフレームスムーザーを導入する。最後に、長尺動画を効率的に生成するため、各短いクリップを全体的な一貫性を持って個別に合成する階層的サンプラーを利用する。これらのモジュールを備えたControlVideoは、広範なモーションプロンプトペアにおいて定量的および定性的に最先端の技術を上回る。特に、効率的な設計のおかげで、NVIDIA 2080Tiを使用して短尺および長尺動画を数分以内に生成することができる。コードはhttps://github.com/YBYBZhang/ControlVideoで公開されている。

English

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a training-free framework called ControlVideo to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

ControlVideo: トレーニング不要の制御可能なテキスト-to-ビデオ生成

ControlVideo: Training-free Controllable Text-to-Video Generation

要旨

Support