VideoFactory: 텍스트-비디오 생성을 위한 시공간적 디퓨전에서의 어텐션 교환

초록

우리는 고품질의 개방형 도메인 비디오 생성을 위한 혁신적인 프레임워크인 VideoFactory를 소개합니다. VideoFactory는 워터마크 없는 고화질(1376x768), 와이드스크린(16:9) 비디오를 생성하여 몰입적인 사용자 경험을 제공합니다. 텍스트 지시에 따라 비디오를 생성하는 것은 공간과 시간 간의 복잡한 관계를 모델링해야 하고, 대규모 텍스트-비디오 짝 데이터가 부족하다는 점에서 상당한 어려움을 겪습니다. 기존 접근 방식은 비디오 생성을 위해 시간적 1D 컨볼루션/어텐션 모듈을 추가하여 사전 학습된 텍스트-이미지 생성 모델을 확장했습니다. 그러나 이러한 접근 방식은 공간과 시간을 함께 모델링하는 중요성을 간과했기 때문에 필연적으로 시간적 왜곡과 텍스트-비디오 간의 불일치를 초래했습니다. 본 논문에서는 공간적 인식과 시간적 인식 간의 상호작용을 강화하는 새로운 접근 방식을 제안합니다. 특히, 3D 윈도우에서 공간 블록과 시간 블록 간에 "쿼리" 역할을 교체하는 교차 어텐션 메커니즘을 활용하여 서로를 상호 강화할 수 있도록 합니다. 고품질 비디오 생성을 위한 모델의 잠재력을 최대한 발휘하기 위해, 우리는 HD-VG-130M이라는 대규모 비디오 데이터셋을 구축했습니다. 이 데이터셋은 개방형 도메인에서 수집된 1억 3천만 개의 텍스트-비디오 짝으로 구성되어 있으며, 고화질, 와이드스크린, 워터마크 없는 특성을 보장합니다. 객관적 지표와 사용자 연구를 통해 우리의 접근 방식이 프레임별 품질, 시간적 상관관계, 텍스트-비디오 정렬 측면에서 명확한 차이로 우수함을 입증했습니다.

English

We present VideoFactory, an innovative framework for generating high-quality open-domain videos. VideoFactory excels in producing high-definition (1376x768), widescreen (16:9) videos without watermarks, creating an engaging user experience. Generating videos guided by text instructions poses significant challenges, such as modeling the complex relationship between space and time, and the lack of large-scale text-video paired data. Previous approaches extend pretrained text-to-image generation models by adding temporal 1D convolution/attention modules for video generation. However, these approaches overlook the importance of jointly modeling space and time, inevitably leading to temporal distortions and misalignment between texts and videos. In this paper, we propose a novel approach that strengthens the interaction between spatial and temporal perceptions. In particular, we utilize a swapped cross-attention mechanism in 3D windows that alternates the "query" role between spatial and temporal blocks, enabling mutual reinforcement for each other. To fully unlock model capabilities for high-quality video generation, we curate a large-scale video dataset called HD-VG-130M. This dataset comprises 130 million text-video pairs from the open-domain, ensuring high-definition, widescreen and watermark-free characters. Objective metrics and user studies demonstrate the superiority of our approach in terms of per-frame quality, temporal correlation, and text-video alignment, with clear margins.

VideoFactory: 텍스트-비디오 생성을 위한 시공간적 디퓨전에서의 어텐션 교환

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

초록

Support