ChatPaper.aiChatPaper

无需训练的长视频生成:扩散模型链

Training-free Long Video Generation with Chain of Diffusion Model Experts

August 24, 2024
作者: Wenhao Li, Yichao Cao, Xie Su, Xi Lin, Shan You, Mingkai Zheng, Yi Chen, Chang Xu
cs.AI

摘要

视频生成模型在电影制作等领域具有重要潜力。然而,当前的视频扩散模型需要高计算成本,并且由于视频生成任务的高复杂性而产生次优结果。本文提出了ConFiner,一种高效高质量的视频生成框架,将视频生成分解为更简单的子任务:结构控制和时空细化。它可以利用一系列现成的扩散模型专家生成高质量视频,每个专家负责一个解耦的子任务。在细化过程中,我们引入了协调去噪,可以将多个扩散专家的能力合并为单一采样。此外,我们设计了ConFiner-Long框架,可以在ConFiner上采用三种约束策略生成长连贯视频。实验结果表明,仅需推断成本的10\%,我们的ConFiner在所有客观和主观指标上均超过了代表性模型,如Lavie和Modelscope。而ConFiner-Long可以生成高质量连贯的长视频,最多可达600帧。
English
Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose ConFiner, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure control and spatial-temporal refinement. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

Summary

AI-Generated Summary

PDF242November 16, 2024