FusionFrames: テキストからビデオ生成パイプラインのための効率的なアーキテクチャ設計

要旨

マルチメディア生成手法は人工知能研究において重要な位置を占めています。テキストから画像を生成するモデルはここ数年で高品質な結果を達成してきました。しかし、ビデオ合成手法は最近になって発展し始めました。本論文では、テキストから画像を生成する拡散モデルに基づいた新しい2段階の潜在拡散テキストからビデオ生成アーキテクチャを提案します。第1段階ではビデオのストーリーラインを構築するためのキーフレーム合成を行い、第2段階ではシーンやオブジェクトの動きを滑らかにするための補間フレーム生成に専念します。キーフレーム生成のための複数の時間的コンディショニング手法を比較し、ビデオ生成品質の側面と人間の好みを反映するメトリクスにおいて、時間的レイヤーよりも個別の時間的ブロックを使用することの優位性を示します。私たちの補間モデルの設計は、他のマスク付きフレーム補間手法と比較して計算コストを大幅に削減します。さらに、MoVQベースのビデオデコードスキームの異なる構成を評価し、一貫性を向上させ、より高いPSNR、SSIM、MSE、およびLPIPSスコアを達成します。最後に、私たちのパイプラインを既存のソリューションと比較し、全体でトップ2、オープンソースソリューションの中ではトップ1のスコアを達成しました：CLIPSIM = 0.2976、FVD = 433.054。プロジェクトページ：https://ai-forever.github.io/kandinsky-video/

English

Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/

FusionFrames: テキストからビデオ生成パイプラインのための効率的なアーキテクチャ設計

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

要旨

Support