텍스트 없는 비디오를 활용한 텍스트-투-비디오 생성 확장 레시피

초록

디퓨전 기반 텍스트-투-비디오 생성은 지난해 큰 진전을 이루었지만 여전히 텍스트-투-이미지 생성에 비해 뒤처지고 있습니다. 주요 이유 중 하나는 비디오 캡셔닝의 높은 비용을 고려할 때 공개적으로 이용 가능한 데이터의 규모가 제한적이라는 점입니다(예: WebVid10M의 1천만 개 비디오-텍스트 쌍 vs. LAION의 50억 개 이미지-텍스트 쌍). 대신, YouTube와 같은 비디오 플랫폼에서 라벨이 없는 클립을 수집하는 것이 훨씬 더 쉬울 수 있습니다. 이를 바탕으로, 우리는 텍스트가 없는 비디오로 직접 학습할 수 있는 새로운 텍스트-투-비디오 생성 프레임워크인 TF-T2V를 제안합니다. 이 프레임워크의 핵심 아이디어는 텍스트 디코딩 과정과 시간적 모델링 과정을 분리하는 것입니다. 이를 위해 콘텐츠 브랜치와 모션 브랜치를 도입하고, 이 둘을 가중치를 공유하며 공동으로 최적화합니다. 이러한 파이프라인을 따라, 우리는 텍스트가 없는 비디오를 무작위로 수집하여 학습 데이터셋의 규모를 두 배로 늘렸을 때(즉, 비디오만 있는 WebVid10M) 성능이 향상되는 것을 확인했습니다(FID가 9.67에서 8.19로, FVD가 484에서 441로 감소). 이는 우리의 접근 방식의 확장성을 보여줍니다. 또한, 일부 텍스트 라벨을 다시 도입하여 학습했을 때 모델의 성능이 지속적으로 향상되는 것도 발견했습니다(FID가 8.19에서 7.64로, FVD가 441에서 366으로 감소). 마지막으로, 우리는 기본 텍스트-투-비디오 생성과 조합적 비디오 합성 패러다임 모두에서 우리의 아이디어의 효과성과 일반화 가능성을 검증했습니다. 코드와 모델은 https://tf-t2v.github.io/에서 공개될 예정입니다.

English

Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at https://tf-t2v.github.io/.

텍스트 없는 비디오를 활용한 텍스트-투-비디오 생성 확장 레시피

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

초록

Support