訓練不要な長尺ビデオ生成における拡散モデルチェーンによる専門家

要旨

ビデオ生成モデルは、映画製作などの分野で大きな潜在能力を持っています。しかしながら、現在のビデオ拡散モデルは、ビデオ生成タスクの高い複雑さに起因して高い計算コストがかかり、最適でない結果を生み出しています。本論文では、ConFinerという効率的で高品質なビデオ生成フレームワークを提案します。このフレームワークは、ビデオ生成をより簡単なサブタスクに分解する構造制御と空間-時間の洗練に分けます。それは、オフザシェルフの拡散モデル専門家の連鎖によって高品質のビデオを生成することができ、各専門家が分解されたサブタスクに責任を持ちます。洗練の過程で、複数の拡散専門家の能力を1つのサンプリングに統合できる協調ノイズ除去を導入します。さらに、ConFinerに3つの制約戦略を組み込んだConFiner-Longフレームワークを設計します。実験結果は、推論コストのわずか10%で、当社のConFinerがLavieやModelscopeなどの代表的なモデルをすべての客観的および主観的指標で上回ることを示しています。そして、ConFiner-Longは最大600フレームまでの高品質で一貫したビデオを生成できます。

English

Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose ConFiner, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure control and spatial-temporal refinement. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

訓練不要な長尺ビデオ生成における拡散モデルチェーンによる専門家

Training-free Long Video Generation with Chain of Diffusion Model Experts

要旨

Support