ChatPaper.aiChatPaper

视频俄罗斯方块:走向组合式文本到视频生成

VideoTetris: Towards Compositional Text-to-Video Generation

June 6, 2024
作者: Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui
cs.AI

摘要

扩散模型在文本到视频(T2V)生成中取得了巨大成功。然而,现有方法在处理涉及多个对象或对象数量动态变化的复杂(长)视频生成场景时可能面临挑战。为了解决这些限制,我们提出了VideoTetris,这是一个新颖的框架,可以实现组合式T2V生成。具体而言,我们提出了时空组合式扩散,通过在空间和时间上操纵和组合去噪网络的注意力图,以精确地遵循复杂的文本语义。此外,我们提出了增强的视频数据预处理,以增强关于运动动态和及时理解的训练数据,配备了新的参考帧注意力机制,以改善自回归视频生成的一致性。大量实验证明,我们的VideoTetris在组合式T2V生成中取得了令人印象深刻的定性和定量结果。代码可在以下链接找到:https://github.com/YangLing0818/VideoTetris
English
Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris

Summary

AI-Generated Summary

PDF261December 8, 2024