루미에르: 비디오 생성을 위한 시공간 확산 모델

초록

우리는 현실적이고 다양하며 일관된 동작을 묘사하는 비디오를 합성하기 위해 설계된 텍스트-투-비디오 확산 모델인 Lumiere를 소개한다. 이는 비디오 합성에서 중요한 과제이다. 이를 위해, 우리는 모델을 통해 단일 패스로 비디오의 전체 시간적 지속 시간을 한 번에 생성하는 Space-Time U-Net 아키텍처를 제안한다. 이는 기존의 비디오 모델들이 먼 키프레임을 합성한 후 시간적 초해상도를 수행하는 방식과 대조적이며, 이러한 접근법은 전역적인 시간적 일관성을 달성하기 어렵게 만든다. 공간적 및 (중요하게도) 시간적 다운샘플링과 업샘플링을 모두 배치하고, 사전 훈련된 텍스트-투-이미지 확산 모델을 활용함으로써, 우리의 모델은 다중 시공간 스케일에서 처리하여 전체 프레임 속도의 저해상도 비디오를 직접 생성하는 방법을 학습한다. 우리는 최첨단 텍스트-투-비디오 생성 결과를 보여주며, 우리의 설계가 이미지-투-비디오, 비디오 인페인팅, 스타일화된 생성 등 다양한 콘텐츠 생성 작업과 비디오 편집 애플리케이션을 쉽게 지원함을 입증한다.

English

We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

루미에르: 비디오 생성을 위한 시공간 확산 모델

Lumiere: A Space-Time Diffusion Model for Video Generation

초록

Support