Lumiere:一种用于视频生成的时空扩散模型
Lumiere: A Space-Time Diffusion Model for Video Generation
January 23, 2024
作者: Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri
cs.AI
摘要
我们介绍 Lumiere -- 一种文本到视频扩散模型,旨在合成展现真实、多样和连贯运动的视频 -- 这是视频合成中的一个关键挑战。为此,我们引入了一个时空 U-Net 架构,通过模型中的单次传递一次性生成整个视频的整个时间段。这与现有的视频模型形成对比,后者合成远距离关键帧,然后进行时间超分辨率 -- 这种方法本质上使全局时间一致性难以实现。通过部署空间和(重要的)时间下采样和上采样,并利用预训练的文本到图像扩散模型,我们的模型学会直接生成全帧率、低分辨率视频,通过在多个时空尺度上处理。我们展示了最先进的文本到视频生成结果,并表明我们的设计轻松支持各种内容创建任务和视频编辑应用,包括图像到视频、视频修补和风格化生成。
English
We introduce Lumiere -- a text-to-video diffusion model designed for
synthesizing videos that portray realistic, diverse and coherent motion -- a
pivotal challenge in video synthesis. To this end, we introduce a Space-Time
U-Net architecture that generates the entire temporal duration of the video at
once, through a single pass in the model. This is in contrast to existing video
models which synthesize distant keyframes followed by temporal super-resolution
-- an approach that inherently makes global temporal consistency difficult to
achieve. By deploying both spatial and (importantly) temporal down- and
up-sampling and leveraging a pre-trained text-to-image diffusion model, our
model learns to directly generate a full-frame-rate, low-resolution video by
processing it in multiple space-time scales. We demonstrate state-of-the-art
text-to-video generation results, and show that our design easily facilitates a
wide range of content creation tasks and video editing applications, including
image-to-video, video inpainting, and stylized generation.