Lumiere：一個用於視頻生成的時空擴散模型

摘要

我們介紹 Lumiere ── 一種文本到視頻擴散模型，旨在合成展現逼真、多樣且連貫運動的視頻 ── 這是視頻合成中的一個關鍵挑戰。為此，我們引入了一種空時 U-Net 架構，通過模型中的單次通過生成整個視頻的時間持續。這與現有的視頻模型形成對比，後者合成遠程關鍵幀，然後進行時間超分辨率 ── 這種方法從根本上使全局時間一致性難以實現。通過部署空間和（重要的）時間下採樣和上採樣，並利用預先訓練的文本到圖像擴散模型，我們的模型學會了通過在多個空時尺度處理來直接生成全幀率、低分辨率視頻。我們展示了最先進的文本到視頻生成結果，並表明我們的設計輕鬆促進了各種內容創作任務和視頻編輯應用，包括圖像到視頻、視頻修補和風格化生成。

English

We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

Lumiere：一個用於視頻生成的時空擴散模型

Lumiere: A Space-Time Diffusion Model for Video Generation

摘要

Support