FancyVideo:通过跨帧文本引导实现动态和一致的视频生成
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance
August 15, 2024
作者: Jiasong Feng, Ao Ma, Jing Wang, Bo Cheng, Xiaodan Liang, Dawei Leng, Yuhui Yin
cs.AI
摘要
在人工智能领域,合成充满运动且时间连贯的视频仍然是一个挑战,尤其是在处理较长时间跨度时。现有的文本到视频(T2V)模型通常采用空间交叉注意力进行文本控制,等效地引导不同帧的生成而无需特定于帧的文本引导。因此,模型理解提示中传达的时间逻辑并生成具有连贯运动的视频的能力受到限制。为了解决这一局限性,我们引入了FancyVideo,这是一种创新的视频生成器,通过精心设计的跨帧文本引导模块(CTGM)改进了现有的文本控制机制。具体而言,CTGM在交叉注意力的开始、中间和结尾分别整合了时间信息注入器(TII)、时间亲和度细化器(TAR)和时间特征增强器(TFB),以实现特定于帧的文本引导。首先,TII将来自潜在特征的特定于帧的信息注入到文本条件中,从而获得交叉帧文本条件。然后,TAR在时间维度上优化交叉帧文本条件与潜在特征之间的相关性矩阵。最后,TFB增强潜在特征的时间连贯性。包括定量和定性评估的大量实验证明了FancyVideo的有效性。我们的方法在EvalCrafter基准测试上实现了最先进的T2V生成结果,并促进了动态和连贯视频的合成。视频展示结果可在https://fancyvideo.github.io/ 上查看,我们将公开我们的代码和模型权重。
English
Synthesizing motion-rich and temporally consistent videos remains a challenge
in artificial intelligence, especially when dealing with extended durations.
Existing text-to-video (T2V) models commonly employ spatial cross-attention for
text control, equivalently guiding different frame generations without
frame-specific textual guidance. Thus, the model's capacity to comprehend the
temporal logic conveyed in prompts and generate videos with coherent motion is
restricted. To tackle this limitation, we introduce FancyVideo, an innovative
video generator that improves the existing text-control mechanism with the
well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM
incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner
(TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of
cross-attention, respectively, to achieve frame-specific textual guidance.
Firstly, TII injects frame-specific information from latent features into text
conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines
the correlation matrix between cross-frame textual conditions and latent
features along the time dimension. Lastly, TFB boosts the temporal consistency
of latent features. Extensive experiments comprising both quantitative and
qualitative evaluations demonstrate the effectiveness of FancyVideo. Our
approach achieves state-of-the-art T2V generation results on the EvalCrafter
benchmark and facilitates the synthesis of dynamic and consistent videos. The
video show results can be available at https://fancyvideo.github.io/, and we
will make our code and model weights publicly available.