Make-Your-Video：使用文本和結構引導進行定制視頻生成

摘要

從我們想像中的事件或情境中創建生動的影片是一種真正迷人的體驗。最近在文本到影片合成方面的進步揭示了僅需提示即可實現此目標的潛力。雖然文本在傳達整體場景背景方面很方便，但可能不足以精確控制。本文探討了通過利用文本作為上下文描述和運動結構（例如逐幀深度）作為具體指導來進行定制影片生成的方法。我們的方法被稱為“製作您的影片”，涉及使用預先訓練用於靜態圖像合成的潛在擴散模型，然後通過引入時間模塊來促進影片生成的聯合條件影片生成。這種兩階段學習方案不僅減少了所需的計算資源，還通過將僅在圖像數據集中可用的豐富概念轉移到影片生成中來提高性能。此外，我們使用了一種簡單而有效的因果注意力遮罩策略，以實現更長的影片合成，從而有效地減輕潛在的質量降級。實驗結果顯示我們的方法在各方面優於現有基準線，特別是在時間一致性和對用戶指導的忠實度方面。此外，我們的模型實現了幾個引人入勝的應用，展示了實際應用潛力。

English

Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.

Make-Your-Video：使用文本和結構引導進行定制視頻生成

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

摘要

Support