FIFO-Diffusion:從文本生成無需訓練的無限視頻
FIFO-Diffusion: Generating Infinite Videos from Text without Training
May 19, 2024
作者: Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han
cs.AI
摘要
我們提出了一種基於預訓練擴散模型的新型推論技術,用於文本條件下的視頻生成。我們的方法稱為FIFO-Diffusion,從概念上講能夠生成無需訓練的無限長視頻。這是通過迭代執行對角去噪來實現的,該方法同時處理一系列具有增加噪聲水平的連續幀,我們的方法在頭部出列一個完全去噪的幀,同時在尾部入列一個新的隨機噪聲幀。然而,對角去噪是一把雙刃劍,因為接近尾部的幀可以通過向前引用利用更乾淨的幀,但這種策略會引起訓練和推論之間的差異。因此,我們引入了潛在分區來減少訓練和推論之間的差距,並引入了前瞻去噪以利用向前引用的好處。我們已經展示了所提出方法在現有文本到視頻生成基準上的有希望的結果和有效性。
English
We propose a novel inference technique based on a pretrained diffusion model
for text-conditional video generation. Our approach, called FIFO-Diffusion, is
conceptually capable of generating infinitely long videos without training.
This is achieved by iteratively performing diagonal denoising, which
concurrently processes a series of consecutive frames with increasing noise
levels in a queue; our method dequeues a fully denoised frame at the head while
enqueuing a new random noise frame at the tail. However, diagonal denoising is
a double-edged sword as the frames near the tail can take advantage of cleaner
ones by forward reference but such a strategy induces the discrepancy between
training and inference. Hence, we introduce latent partitioning to reduce the
training-inference gap and lookahead denoising to leverage the benefit of
forward referencing. We have demonstrated the promising results and
effectiveness of the proposed methods on existing text-to-video generation
baselines.Summary
AI-Generated Summary