ChatPaper.aiChatPaper

Lumiere:一個用於視頻生成的時空擴散模型

Lumiere: A Space-Time Diffusion Model for Video Generation

January 23, 2024
作者: Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri
cs.AI

摘要

我們介紹 Lumiere ── 一種文本到視頻擴散模型,旨在合成展現逼真、多樣且連貫運動的視頻 ── 這是視頻合成中的一個關鍵挑戰。為此,我們引入了一種空時 U-Net 架構,通過模型中的單次通過生成整個視頻的時間持續。這與現有的視頻模型形成對比,後者合成遠程關鍵幀,然後進行時間超分辨率 ── 這種方法從根本上使全局時間一致性難以實現。通過部署空間和(重要的)時間下採樣和上採樣,並利用預先訓練的文本到圖像擴散模型,我們的模型學會了通過在多個空時尺度處理來直接生成全幀率、低分辨率視頻。我們展示了最先進的文本到視頻生成結果,並表明我們的設計輕鬆促進了各種內容創作任務和視頻編輯應用,包括圖像到視頻、視頻修補和風格化生成。
English
We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
PDF8610December 15, 2024