展示1：將像素和潛在擴散模型結合於文本到視頻生成

摘要

在大規模預訓練文本到視頻擴散模型（VDMs）領域取得了顯著進展。然而，先前的方法要麼僅依賴基於像素的VDMs，這將帶來高計算成本，要麼依賴基於潛在變量的VDMs，這往往難以實現精確的文本-視頻對齊。在本文中，我們首次提出了一種混合模型，名為Show-1，將基於像素和基於潛在變量的VDMs結合起來進行文本到視頻生成。我們的模型首先使用基於像素的VDMs生成具有強文本-視頻相關性的低分辨率視頻。之後，我們提出了一種新的專家翻譯方法，利用基於潛在變量的VDMs進一步將低分辨率視頻上採樣到高分辨率。與潛在VDMs相比，Show-1能夠生成具有精確文本-視頻對齊的高質量視頻；與像素VDMs相比，Show-1更高效（推理過程中的GPU內存使用為15G vs 72G）。我們還在標準視頻生成基準上驗證了我們的模型。我們的代碼和模型權重可以在https://github.com/showlab/Show-1 公開獲取。

English

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

展示1：將像素和潛在擴散模型結合於文本到視頻生成

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

摘要

Support