**FusionFrames:文字轉影片生成流程的高效架構設計**
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
November 22, 2023
作者: Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov
cs.AI
摘要
多媒體生成方法在人工智慧研究中佔據重要地位。文字轉圖像模型在過去幾年間已實現高品質成果,然而影片合成方法直至近期才開始蓬勃發展。本文提出一種基於文字轉圖像擴散模型的新型兩階段潛在擴散文字轉影片生成架構。第一階段專注於關鍵影格合成以構建影片敘事框架,第二階段則致力於插補影格生成以使場景與物體運動流暢自然。我們針對關鍵影格生成比較了多種時間條件設定方法,結果顯示在反映影片生成品質指標與人類偏好方面,採用獨立時間區塊的設計優於時間層級結構。與其他遮罩影格插值方法相比,我們的插補模型設計顯著降低了計算成本。此外,我們評估了基於MoVQ的影片解碼方案的不同配置,以提升連貫性並獲得更高的PSNR、SSIM、MSE和LPIPS評分。最終,我們將本流程與現有解決方案進行比較,在整體評比中取得第二名的成績,並在開源方案中位列第一:CLIPSIM=0.2976,FVD=433.054。專案頁面:https://ai-forever.github.io/kandinsky-video/
English
Multimedia generation approaches occupy a prominent place in artificial
intelligence research. Text-to-image models achieved high-quality results over
the last few years. However, video synthesis methods recently started to
develop. This paper presents a new two-stage latent diffusion text-to-video
generation architecture based on the text-to-image diffusion model. The first
stage concerns keyframes synthesis to figure the storyline of a video, while
the second one is devoted to interpolation frames generation to make movements
of the scene and objects smooth. We compare several temporal conditioning
approaches for keyframes generation. The results show the advantage of using
separate temporal blocks over temporal layers in terms of metrics reflecting
video generation quality aspects and human preference. The design of our
interpolation model significantly reduces computational costs compared to other
masked frame interpolation approaches. Furthermore, we evaluate different
configurations of MoVQ-based video decoding scheme to improve consistency and
achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our
pipeline with existing solutions and achieve top-2 scores overall and top-1
among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:
https://ai-forever.github.io/kandinsky-video/