擴展文本到視頻生成的配方:無文本視頻的擴展
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
December 25, 2023
作者: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang
cs.AI
摘要
在過去一年中,基於擴散的文本到視頻生成取得了令人矚目的進展,但仍落後於文本到圖像生成。其中一個關鍵原因是公開可用數據規模有限(例如,在WebVid10M中有10M個視頻文本對,而在LAION中有50億個圖像文本對),考慮到視頻字幕的高成本。相反,從YouTube等視頻平台收集未標記的片段可能更容易。受此激勵,我們提出了一種新穎的文本到視頻生成框架,稱為TF-T2V,可以直接從無文本的視頻中學習。其背後的原理是將文本解碼過程與時間建模過程分開。為此,我們採用了一個內容分支和一個運動分支,並通過共享權重進行聯合優化。按照這樣的流程,我們研究了通過將訓練集規模加倍(即僅包含視頻的WebVid10M)並加入一些隨機收集的無文本視頻的效果,並樂於觀察到性能的提升(FID從9.67降至8.19,FVD從484降至441),展示了我們方法的可擴展性。我們還發現,在重新引入一些文本標籤進行訓練後,我們的模型可以獲得可持續的性能提升(FID從8.19降至7.64,FVD從441降至366)。最後,我們驗證了我們的理念在本地文本到視頻生成和組合視頻合成範式上的有效性和泛化能力。代碼和模型將在https://tf-t2v.github.io/ 上公開提供。
English
Diffusion-based text-to-video generation has witnessed impressive progress in
the past year yet still falls behind text-to-image generation. One of the key
reasons is the limited scale of publicly available data (e.g., 10M video-text
pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost
of video captioning. Instead, it could be far easier to collect unlabeled clips
from video platforms like YouTube. Motivated by this, we come up with a novel
text-to-video generation framework, termed TF-T2V, which can directly learn
with text-free videos. The rationale behind is to separate the process of text
decoding from that of temporal modeling. To this end, we employ a content
branch and a motion branch, which are jointly optimized with weights shared.
Following such a pipeline, we study the effect of doubling the scale of
training set (i.e., video-only WebVid10M) with some randomly collected
text-free videos and are encouraged to observe the performance improvement (FID
from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of
our approach. We also find that our model could enjoy sustainable performance
gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some
text labels for training. Finally, we validate the effectiveness and
generalizability of our ideology on both native text-to-video generation and
compositional video synthesis paradigms. Code and models will be publicly
available at https://tf-t2v.github.io/.