登月计划:实现具有多模态条件的可控视频生成和编辑
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions
January 3, 2024
作者: David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo
cs.AI
摘要
大多数现有的视频扩散模型(VDMs)仅限于纯文本条件。因此,它们通常缺乏对生成视频的视觉外观和几何结构的控制。本文介绍了Moonshot,一种新的视频生成模型,同时基于图像和文本的多模态输入进行条件设置。该模型建立在一个核心模块上,称为多模态视频块(MVB),它由用于表示视频特征的传统时空层和用于处理外观条件的图像和文本输入的解耦交叉注意力层组成。此外,我们精心设计了模型架构,使其可以选择性地与预训练的图像控制网络模块集成,用于几何视觉条件,而无需额外的训练开销,与先前的方法相比。实验证明,借助多功能多模态条件机制,Moonshot在视觉质量和时间一致性方面相比现有模型取得了显著改进。此外,该模型可以轻松重新应用于各种生成应用,如个性化视频生成、图像动画和视频编辑,揭示了其作为可控视频生成基础架构的潜力。模型将在https://github.com/salesforce/LAVIS 上公开。
English
Most existing video diffusion models (VDMs) are limited to mere text
conditions. Thereby, they are usually lacking in control over visual appearance
and geometry structure of the generated videos. This work presents Moonshot, a
new video generation model that conditions simultaneously on multimodal inputs
of image and text. The model builts upon a core module, called multimodal video
block (MVB), which consists of conventional spatialtemporal layers for
representing video features, and a decoupled cross-attention layer to address
image and text inputs for appearance conditioning. In addition, we carefully
design the model architecture such that it can optionally integrate with
pre-trained image ControlNet modules for geometry visual conditions, without
needing of extra training overhead as opposed to prior methods. Experiments
show that with versatile multimodal conditioning mechanisms, Moonshot
demonstrates significant improvement on visual quality and temporal consistency
compared to existing models. In addition, the model can be easily repurposed
for a variety of generative applications, such as personalized video
generation, image animation and video editing, unveiling its potential to serve
as a fundamental architecture for controllable video generation. Models will be
made public on https://github.com/salesforce/LAVIS.