ChatPaper.aiChatPaper

無需訓練的長視頻生成:擴散模型鏈。

Training-free Long Video Generation with Chain of Diffusion Model Experts

August 24, 2024
作者: Wenhao Li, Yichao Cao, Xie Su, Xi Lin, Shan You, Mingkai Zheng, Yi Chen, Chang Xu
cs.AI

摘要

影片生成模型在電影製作等領域具有相當大的潛力。然而,目前的影片擴散模型需要高計算成本,並因影片生成任務的高複雜性而產生次優結果。本文提出了ConFiner,一個高效且高質量的影片生成框架,將影片生成分解為較簡單的子任務:結構控制和時空細化。它可以利用一系列現成的擴散模型專家生成高質量的影片,每個專家負責一個解耦的子任務。在細化過程中,我們引入協調去噪,可以將多個擴散專家的能力合併為單一取樣。此外,我們設計了ConFiner-Long框架,可以在ConFiner上應用三種約束策略生成長且連貫的影片。實驗結果表明,僅需10%的推論成本,我們的ConFiner在所有客觀和主觀指標上均超越了代表性模型,如Lavie和Modelscope。而ConFiner-Long可以生成高質量且連貫的影片,長達600幀。
English
Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose ConFiner, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure control and spatial-temporal refinement. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

Summary

AI-Generated Summary

PDF242November 16, 2024