Make-Your-Video:使用文本和结构指导生成定制视频
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
June 1, 2023
作者: Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong
cs.AI
摘要
从我们的想象中的事件或场景中创建生动的视频是一种非常迷人的体验。最近文本到视频合成方面的进展揭示了只需提示就能实现这一目标的潜力。虽然文本在传达整体场景背景方面很方便,但可能不足以精确控制。在本文中,我们通过利用文本作为上下文描述和运动结构(例如逐帧深度)作为具体指导,探讨定制视频生成。我们的方法被称为“制作您的视频”,涉及使用预先训练用于静态图像合成的潜在扩散模型进行联合条件视频生成,然后通过引入时间模块促进视频生成。这种两阶段学习方案不仅减少了所需的计算资源,还通过将仅在图像数据集中可用的丰富概念转移到视频生成中来提高性能。此外,我们使用了一种简单而有效的因果注意力蒙版策略,以实现更长的视频合成,从而有效地减轻了潜在的质量下降。实验结果显示我们的方法在时间连贯性和对用户指导的忠实度方面优于现有基准线。此外,我们的模型实现了几种引人入胜的应用,展示了实际使用的潜力。
English
Creating a vivid video from the event or scenario in our imagination is a
truly fascinating experience. Recent advancements in text-to-video synthesis
have unveiled the potential to achieve this with prompts only. While text is
convenient in conveying the overall scene context, it may be insufficient to
control precisely. In this paper, we explore customized video generation by
utilizing text as context description and motion structure (e.g. frame-wise
depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves
joint-conditional video generation using a Latent Diffusion Model that is
pre-trained for still image synthesis and then promoted for video generation
with the introduction of temporal modules. This two-stage learning scheme not
only reduces the computing resources required, but also improves the
performance by transferring the rich concepts available in image datasets
solely into video generation. Moreover, we use a simple yet effective causal
attention mask strategy to enable longer video synthesis, which mitigates the
potential quality degradation effectively. Experimental results show the
superiority of our method over existing baselines, particularly in terms of
temporal coherence and fidelity to users' guidance. In addition, our model
enables several intriguing applications that demonstrate potential for
practical usage.