VideoTetris: 合成型テキスト-to-ビデオ生成に向けて

要旨

Diffusionモデルは、テキストからビデオ（T2V）生成において大きな成功を収めています。しかし、既存の手法は、複数のオブジェクトやオブジェクト数の動的な変化を含む複雑な（長い）ビデオ生成シナリオを扱う際に課題に直面する可能性があります。これらの制限に対処するため、我々はVideoTetrisという新しいフレームワークを提案します。これは、合成的なT2V生成を可能にするものです。具体的には、時空間的な合成的Diffusionを提案し、ノイズ除去ネットワークのアテンションマップを空間的および時間的に操作・合成することで、複雑なテキストの意味を正確に追従します。さらに、モーションダイナミクスとプロンプト理解に関するトレーニングデータを強化するための拡張ビデオデータ前処理を提案し、自動回帰的なビデオ生成の一貫性を向上させる新しい参照フレームアテンションメカニズムを備えています。広範な実験により、我々のVideoTetrisが合成的T2V生成において印象的な定性的および定量的な結果を達成することが示されています。コードは以下で公開されています: https://github.com/YangLing0818/VideoTetris

English

Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris

VideoTetris: 合成型テキスト-to-ビデオ生成に向けて

VideoTetris: Towards Compositional Text-to-Video Generation

要旨

Support