Control-A-Video: 확산 모델을 활용한 제어 가능한 텍스트-비디오 생성

초록

본 논문은 에지 또는 깊이 맵과 같은 제어 신호 시퀀스에 따라 비디오를 생성하는 제어 가능한 텍스트-투-비디오(T2V) 확산 모델인 Video-ControlNet을 소개한다. Video-ControlNet은 사전 훈련된 조건부 텍스트-투-이미지(T2I) 확산 모델을 기반으로, 공간-시간적 자기 주의 메커니즘과 학습 가능한 시간적 레이어를 통합하여 프레임 간 효율적인 모델링을 가능하게 한다. 또한, 이미지 도메인에서 전이된 비디오 생성 및 자동 회귀 방식으로 임의 길이의 비디오 생성을 용이하게 하는 첫 프레임 조건화 전략을 제안한다. 더 나아가, Video-ControlNet은 입력 비디오로부터 모션 사전 정보를 도입하여 더 일관된 비디오를 생성하기 위한 새로운 잔차 기반 노이즈 초기화 전략을 채택한다. 제안된 아키텍처와 전략을 통해 Video-ControlNet은 자원 효율적인 수렴을 달성하고, 세밀한 제어가 가능한 우수한 품질과 일관성을 가진 비디오를 생성할 수 있다. 다양한 비디오 생성 작업(예: 비디오 편집 및 비디오 스타일 전이)에서의 광범위한 실험을 통해, Video-ControlNet이 일관성과 품질 측면에서 기존 방법들을 능가함을 입증한다. 프로젝트 페이지: https://controlavideo.github.io/

English

This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/

Control-A-Video: 확산 모델을 활용한 제어 가능한 텍스트-비디오 생성

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

초록

Support