VSTAR:用於長時間動態視頻合成的生成式時間護理
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
March 20, 2024
作者: Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, Anna Khoreva
cs.AI
摘要
儘管在文本到視頻(T2V)合成領域取得了巨大進展,開源的T2V擴散模型在生成具有動態變化和演變內容的較長視頻方面仍然存在困難。它們往往合成幾乎靜態的視頻,忽略了文本提示中所隱含的必要的隨時間變化的視覺變化。同時,將這些模型擴展以實現更長、更動態的視頻合成往往在計算上難以實現。為應對這一挑戰,我們引入了生成時序護理(GTN)的概念,旨在在推理過程中即時改變生成過程,以提高對時序動態的控制,並實現生成更長視頻的目標。我們提出了一種名為VSTAR的GTN方法,包括兩個關鍵要素:1)視頻摘要提示(VSP)- 基於原始單個提示利用LLM自動生成視頻摘要,為更長視頻的不同視覺狀態提供準確的文本指導;2)時間注意力正則化(TAR)- 一種正則化技術,用於改進預訓練的T2V擴散模型的時間注意力單元,實現對視頻動態的控制。我們在實驗中展示了所提方法在生成比現有開源的T2V模型更長、視覺上更吸引人的視頻方面的優越性。此外,我們分析了應用和未應用VSTAR時實現的時間注意力地圖,展示了應用我們的方法以減輕對所需視覺隨時間變化的忽視的重要性。
English
Despite tremendous progress in the field of text-to-video (T2V) synthesis,
open-sourced T2V diffusion models struggle to generate longer videos with
dynamically varying and evolving content. They tend to synthesize quasi-static
videos, ignoring the necessary visual change-over-time implied in the text
prompt. At the same time, scaling these models to enable longer, more dynamic
video synthesis often remains computationally intractable. To address this
challenge, we introduce the concept of Generative Temporal Nursing (GTN), where
we aim to alter the generative process on the fly during inference to improve
control over the temporal dynamics and enable generation of longer videos. We
propose a method for GTN, dubbed VSTAR, which consists of two key ingredients:
1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis
based on the original single prompt leveraging LLMs, which gives accurate
textual guidance to different visual states of longer videos, and 2) Temporal
Attention Regularization (TAR) - a regularization technique to refine the
temporal attention units of the pre-trained T2V diffusion models, which enables
control over the video dynamics. We experimentally showcase the superiority of
the proposed approach in generating longer, visually appealing videos over
existing open-sourced T2V models. We additionally analyze the temporal
attention maps realized with and without VSTAR, demonstrating the importance of
applying our method to mitigate neglect of the desired visual change over time.Summary
AI-Generated Summary