VSTAR:用于长时间动态视频合成的生成式时间护理
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
March 20, 2024
作者: Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, Anna Khoreva
cs.AI
摘要
尽管在文本到视频(T2V)合成领域取得了巨大进展,开源的T2V扩散模型仍然难以生成具有动态变化和演化内容的较长视频。它们往往合成准静态视频,忽略了文本提示中所暗示的必要的随时间变化的视觉变化。与此同时,将这些模型扩展以实现更长、更动态的视频合成往往在计算上难以实现。为了解决这一挑战,我们引入了“生成时序护理”(GTN)的概念,旨在通过在推理过程中实时改变生成过程,以提高对时序动态的控制,并实现生成更长视频。我们提出了一种名为VSTAR的GTN方法,包括两个关键要素:1)视频摘要提示(VSP)- 基于原始单一提示利用LLM自动生成视频摘要,为更长视频的不同视觉状态提供准确的文本指导;2)时间注意力正则化(TAR)- 一种正则化技术,用于优化预训练的T2V扩散模型的时间注意力单元,实现对视频动态的控制。我们通过实验证明了所提方法在生成更长、视觉吸引人的视频方面优于现有的开源T2V模型。此外,我们分析了应用和未应用VSTAR时实现的时间注意力图,展示了应用我们的方法以减少对所需视觉随时间变化的忽视的重要性。
English
Despite tremendous progress in the field of text-to-video (T2V) synthesis,
open-sourced T2V diffusion models struggle to generate longer videos with
dynamically varying and evolving content. They tend to synthesize quasi-static
videos, ignoring the necessary visual change-over-time implied in the text
prompt. At the same time, scaling these models to enable longer, more dynamic
video synthesis often remains computationally intractable. To address this
challenge, we introduce the concept of Generative Temporal Nursing (GTN), where
we aim to alter the generative process on the fly during inference to improve
control over the temporal dynamics and enable generation of longer videos. We
propose a method for GTN, dubbed VSTAR, which consists of two key ingredients:
1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis
based on the original single prompt leveraging LLMs, which gives accurate
textual guidance to different visual states of longer videos, and 2) Temporal
Attention Regularization (TAR) - a regularization technique to refine the
temporal attention units of the pre-trained T2V diffusion models, which enables
control over the video dynamics. We experimentally showcase the superiority of
the proposed approach in generating longer, visually appealing videos over
existing open-sourced T2V models. We additionally analyze the temporal
attention maps realized with and without VSTAR, demonstrating the importance of
applying our method to mitigate neglect of the desired visual change over time.Summary
AI-Generated Summary