AnimateZero: 비디오 확산 모델은 제로샷 이미지 애니메이터입니다

초록

대규모 텍스트-투-비디오(T2V) 확산 모델은 최근 몇 년 동안 시각적 품질, 움직임 및 시간적 일관성 측면에서 큰 발전을 이루었습니다. 그러나 생성 과정은 여전히 블랙박스 상태로, 모든 속성(예: 외관, 움직임)이 대략적인 텍스트 설명 외에는 정밀한 제어 능력 없이 공동으로 학습되고 생성됩니다. 이미지 애니메이션에서 영감을 받아 비디오를 특정 외관과 해당 움직임으로 분리하는 방식을 차용하여, 우리는 사전 훈련된 텍스트-투-비디오 확산 모델인 AnimateDiff를 해체하고 더 정밀한 외관 및 움직임 제어 능력을 제공하는 AnimateZero를 제안합니다. 외관 제어를 위해, 우리는 텍스트-투-이미지(T2I) 생성에서 중간 잠재 변수와 그 특징을 차용하여 생성된 첫 번째 프레임이 주어진 생성 이미지와 동일하도록 보장합니다. 시간적 제어를 위해, 원래 T2V 모델의 전역 시간적 주의 메커니즘을 우리가 제안한 위치 보정 윈도우 주의 메커니즘으로 대체하여 다른 프레임들이 첫 번째 프레임과 잘 정렬되도록 합니다. 제안된 방법을 통해, AnimateZero는 추가 훈련 없이도 생성 과정을 성공적으로 제어할 수 있습니다. 주어진 이미지에 대한 제로샷 이미지 애니메이터로서, AnimateZero는 또한 인터랙티브 비디오 생성 및 실제 이미지 애니메이션을 포함한 여러 새로운 응용 프로그램을 가능하게 합니다. 상세한 실험은 제안된 방법이 T2V 및 관련 응용 프로그램에서의 효과를 입증합니다.

English

Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance, motion) are learned and generated jointly without precise control ability other than rough text descriptions. Inspired by image animation which decouples the video as one specific appearance with the corresponding motion, we propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff, and provide more precise appearance and motion control abilities for it. For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention to ensure other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.

AnimateZero: 비디오 확산 모델은 제로샷 이미지 애니메이터입니다

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

초록

Support