VIMI:通过多模态指导实现视频生成
VIMI: Grounding Video Generation through Multi-modal Instruction
July 8, 2024
作者: Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov
cs.AI
摘要
现有的文本到视频扩散模型仅依赖于文本编码器进行预训练。这种局限性源自缺乏大规模多模态提示视频数据集,导致缺乏视觉基础并限制了它们在多模态集成中的灵活性和应用。为了解决这一问题,我们通过利用检索方法将上下文示例与给定的文本提示配对,构建了一个大规模多模态提示数据集,然后采用两阶段训练策略,使同一模型能够执行多样化的视频生成任务。在第一阶段,我们提出了一个多模态条件视频生成框架,用于在这些增强数据集上进行预训练,为基于基础模型的视频生成奠定基础。其次,我们在三个视频生成任务上对第一阶段的模型进行微调,融合多模态指令。这一过程进一步提升了模型处理多样化输入和任务的能力,确保多模态信息的无缝集成。经过这两阶段的训练过程后,VIMI展示了多模态理解能力,生成了基于提供的输入的上下文丰富且个性化的视频,如图1所示。与先前的视觉基础视频生成方法相比,VIMI能够合成具有大运动的一致且时间连贯的视频,同时保持语义控制。最后,VIMI还在UCF101基准测试上实现了最先进的文本到视频生成结果。
English
Existing text-to-video diffusion models rely solely on text-only encoders for
their pretraining. This limitation stems from the absence of large-scale
multimodal prompt video datasets, resulting in a lack of visual grounding and
restricting their versatility and application in multimodal integration. To
address this, we construct a large-scale multimodal prompt dataset by employing
retrieval methods to pair in-context examples with the given text prompts and
then utilize a two-stage training strategy to enable diverse video generation
tasks within the same model. In the first stage, we propose a multimodal
conditional video generation framework for pretraining on these augmented
datasets, establishing a foundational model for grounded video generation.
Secondly, we finetune the model from the first stage on three video generation
tasks, incorporating multi-modal instructions. This process further refines the
model's ability to handle diverse inputs and tasks, ensuring seamless
integration of multi-modal information. After this two-stage train-ing process,
VIMI demonstrates multimodal understanding capabilities, producing contextually
rich and personalized videos grounded in the provided inputs, as shown in
Figure 1. Compared to previous visual grounded video generation methods, VIMI
can synthesize consistent and temporally coherent videos with large motion
while retaining the semantic control. Lastly, VIMI also achieves
state-of-the-art text-to-video generation results on UCF101 benchmark.Summary
AI-Generated Summary