重複利用與擴散：適用於文本到視頻生成的迭代去噪

摘要

受到潛在擴散模型（LDMs）在影像合成方面的卓越成功所啟發，我們研究了將LDM應用於文本到視訊生成的挑戰，這是一項艱鉅的任務，因為在模型訓練和推論過程中存在著計算和記憶限制。單個LDM通常僅能生成非常有限數量的視訊幀。一些現有研究專注於為生成更多視訊幀而設計獨立的預測模型，然而這些方法會增加額外的訓練成本並導致幀級抖動。在本文中，我們提出了一個名為“重用和擴散”（VidRD）的框架，以生成更多跟隨LDM已生成幀的視訊幀。在初始視訊片段的條件下，通過重複使用原始潛在特徵並遵循先前的擴散過程來迭代生成額外的幀。此外，為了用於像素空間和潛在空間之間轉換的自編碼器，我們將時間層注入其解碼器，並微調這些層以獲得更高的時間一致性。我們還提出了一組策略，用於組合包含來自多個現有數據集的多樣內容的視訊文本數據，包括用於動作識別的視訊數據集和圖像文本數據集。大量實驗表明，我們的方法在定量和定性評估中均取得了良好的結果。我們的項目頁面可在以下網址找到：https://anonymous0x233.github.io/ReuseAndDiffuse/{這裡}。

English

Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed VidRD to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available https://anonymous0x233.github.io/ReuseAndDiffuse/{here}.

重複利用與擴散：適用於文本到視頻生成的迭代去噪

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

摘要

Support