通过无文本视频进行文本到视频生成的扩展配方

摘要

基于扩散的文本到视频生成在过去一年取得了令人瞩目的进展，但仍落后于文本到图像生成。其中一个关键原因是公开可用数据规模有限（例如，WebVid10M中有1000万个视频文本对，而LAION中有50亿个图像文本对），考虑到视频字幕制作的高成本。相比之下，从YouTube等视频平台收集未标记的视频片段可能更容易。受此启发，我们提出了一种新颖的文本到视频生成框架，称为TF-T2V，可以直接学习无文本视频。其背后的原理是将文本解码过程与时间建模过程分开。为此，我们采用内容分支和动作分支，共同优化并共享权重。沿着这样的流程，我们研究了通过将训练集规模加倍（即仅视频的WebVid10M）与一些随机收集的无文本视频相结合的效果，并鼓舞地观察到性能的提升（FID从9.67提高到8.19，FVD从484降低到441），展示了我们方法的可扩展性。我们还发现，在重新引入一些文本标签进行训练后，我们的模型可以持续获得性能提升（FID从8.19降低到7.64，FVD从441降低到366）。最后，我们验证了我们的理念在本地文本到视频生成和组合视频合成范式上的有效性和泛化能力。代码和模型将在https://tf-t2v.github.io/ 上公开提供。

English

Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at https://tf-t2v.github.io/.

通过无文本视频进行文本到视频生成的扩展配方

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

摘要

Support