登月计划：实现具有多模态条件的可控视频生成和编辑

摘要

大多数现有的视频扩散模型（VDMs）仅限于纯文本条件。因此，它们通常缺乏对生成视频的视觉外观和几何结构的控制。本文介绍了Moonshot，一种新的视频生成模型，同时基于图像和文本的多模态输入进行条件设置。该模型建立在一个核心模块上，称为多模态视频块（MVB），它由用于表示视频特征的传统时空层和用于处理外观条件的图像和文本输入的解耦交叉注意力层组成。此外，我们精心设计了模型架构，使其可以选择性地与预训练的图像控制网络模块集成，用于几何视觉条件，而无需额外的训练开销，与先前的方法相比。实验证明，借助多功能多模态条件机制，Moonshot在视觉质量和时间一致性方面相比现有模型取得了显著改进。此外，该模型可以轻松重新应用于各种生成应用，如个性化视频生成、图像动画和视频编辑，揭示了其作为可控视频生成基础架构的潜力。模型将在https://github.com/salesforce/LAVIS 上公开。

English

Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.

登月计划：实现具有多模态条件的可控视频生成和编辑

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

摘要

Support