AnimateZero：視頻擴散模型是零樣本圖像動畫製作者。

摘要

近年來，大規模文本到影片（T2V）擴散模型在視覺品質、動態和時間一致性方面取得了巨大進展。然而，生成過程仍然是一個黑盒子，其中所有屬性（例如外觀、動態）都是一起學習和生成的，除了粗略的文本描述之外，沒有精確的控制能力。受到圖像動畫的啟發，該方法將影片解耦為具有相應動態的特定外觀，我們提出了AnimateZero來揭示預先訓練的文本到影片擴散模型，即AnimateDiff，並為其提供更精確的外觀和動態控制能力。對於外觀控制，我們從文本到圖像（T2I）生成中借用中間潛在變數及其特徵，以確保生成的第一幀與給定的生成圖像相等。對於時間控制，我們將原始T2V模型的全局時間注意力替換為我們提出的位置校正窗口注意力，以確保其他幀與第一幀良好對齊。借助所提出的方法，AnimateZero可以成功控制生成過程，無需進一步訓練。作為給定圖像的零樣本圖像動畫製作者，AnimateZero還可以實現多個新應用，包括交互式視頻生成和真實圖像動畫。詳細的實驗證明了所提出方法在T2V及相關應用中的有效性。

English

Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance, motion) are learned and generated jointly without precise control ability other than rough text descriptions. Inspired by image animation which decouples the video as one specific appearance with the corresponding motion, we propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff, and provide more precise appearance and motion control abilities for it. For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention to ensure other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.

AnimateZero：視頻擴散模型是零樣本圖像動畫製作者。

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

摘要

Support