Make-Your-Video: テキストと構造的ガイダンスを用いたカスタマイズ動画生成

要旨

私たちの想像の中の出来事やシナリオから鮮やかな動画を作り出すことは、実に魅力的な体験です。最近のテキストから動画への合成技術の進歩により、プロンプトのみでこれを実現する可能性が明らかになりました。テキストは全体のシーンコンテキストを伝えるのに便利ですが、精密な制御には不十分な場合があります。本論文では、テキストをコンテキスト記述として、モーション構造（例：フレームごとの深度）を具体的なガイダンスとして活用したカスタマイズ動画生成を探求します。私たちの手法「Make-Your-Video」は、静止画合成用に事前学習されたLatent Diffusion Modelを使用し、時間的モジュールの導入により動画生成に昇格させた、共同条件付き動画生成を採用しています。この2段階学習スキームは、必要な計算リソースを削減するだけでなく、画像データセットに含まれる豊富な概念を動画生成に転送することで性能を向上させます。さらに、シンプルでありながら効果的な因果的アテンションマスク戦略を使用して、より長い動画合成を可能にし、品質の低下を効果的に軽減します。実験結果は、特に時間的整合性とユーザーガイダンスへの忠実度において、既存のベースラインに対する私たちの手法の優位性を示しています。さらに、私たちのモデルは、実用的な使用の可能性を示すいくつかの興味深いアプリケーションを可能にします。

English

Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.

Make-Your-Video: テキストと構造的ガイダンスを用いたカスタマイズ動画生成

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

要旨

Support