Show-1: 픽셀과 잠재 확산 모델의 융합을 통한 텍스트-투-비디오 생성

초록

대규모 사전 학습된 텍스트-투-비디오 확산 모델(VDMs) 분야에서 상당한 진전이 이루어졌습니다. 그러나 기존 방법들은 고도의 계산 비용이 발생하는 픽셀 기반 VDMs에만 의존하거나, 정확한 텍스트-비디오 정렬에 어려움을 겪는 잠재 공간 기반 VDMs에만 의존하는 한계가 있었습니다. 본 논문에서는 텍스트-투-비디오 생성을 위해 픽셀 기반과 잠재 공간 기반 VDMs를 결합한 하이브리드 모델인 Show-1을 최초로 제안합니다. 우리의 모델은 먼저 픽셀 기반 VDMs를 사용하여 강력한 텍스트-비디오 상관관계를 가진 저해상도 비디오를 생성합니다. 이후, 잠재 공간 기반 VDMs를 활용하여 저해상도 비디오를 고해상도로 업샘플링하는 새로운 전문가 변환 방법을 제안합니다. 잠재 공간 VDMs와 비교했을 때, Show-1은 정확한 텍스트-비디오 정렬을 가진 고품질 비디오를 생성할 수 있으며, 픽셀 VDMs와 비교했을 때 훨씬 더 효율적입니다(추론 중 GPU 메모리 사용량이 15G 대 72G). 또한, 표준 비디오 생성 벤치마크에서 우리의 모델을 검증했습니다. 우리의 코드와 모델 가중치는 https://github.com/showlab/Show-1에서 공개적으로 이용 가능합니다.

English

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

Show-1: 픽셀과 잠재 확산 모델의 융합을 통한 텍스트-투-비디오 생성

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

초록

Support