스냅 비디오: 텍스트-비디오 합성을 위한 확장 시공간 트랜스포머

초록

현대의 이미지 생성 모델은 놀라운 품질과 다양성을 보여준다. 이러한 장점에 이끌려 연구 커뮤니티는 이를 비디오 생성에 재활용하고 있다. 그러나 비디오 콘텐츠는 매우 중복적이기 때문에, 이미지 모델의 발전을 비디오 생성 영역에 단순히 적용하는 것은 움직임의 충실도와 시각적 품질을 저하시키고 확장성을 해친다고 주장한다. 본 연구에서는 이러한 문제를 체계적으로 해결하기 위해 비디오 중심 모델인 Snap Video를 구축한다. 이를 위해 먼저 EDM 프레임워크를 확장하여 공간적 및 시간적으로 중복된 픽셀을 고려하고 비디오 생성을 자연스럽게 지원한다. 둘째, 이미지 생성의 핵심인 U-Net이 비디오 생성 시 확장성이 떨어지며 상당한 계산 오버헤드를 요구한다는 점을 보여준다. 따라서 우리는 U-Net보다 3.31배 빠르게 학습하고 추론 시 약 4.5배 빠른 새로운 트랜스포머 기반 아키텍처를 제안한다. 이를 통해 처음으로 수십억 개의 파라미터를 가진 텍스트-투-비디오 모델을 효율적으로 학습할 수 있었으며, 여러 벤치마크에서 최첨단 결과를 달성하고 훨씬 더 높은 품질, 시간적 일관성, 그리고 움직임 복잡성을 가진 비디오를 생성할 수 있었다. 사용자 연구 결과, 우리의 모델은 최신 방법들보다 큰 차이로 선호되었다. 자세한 내용은 https://snap-research.github.io/snapvideo/에서 확인할 수 있다.

English

Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

스냅 비디오: 텍스트-비디오 합성을 위한 확장 시공간 트랜스포머

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

초록

Support