ChatPaper.aiChatPaper

细粒度零样本视频采样

Fine-gained Zero-shot Video Sampling

July 31, 2024
作者: Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu
cs.AI

摘要

将时间维度纳入预训练的图像扩散模型用于视频生成是一种普遍的方法。然而,这种方法在计算上要求很高,并需要大规模的视频数据集。更为关键的是,图像和视频数据集之间的异质性通常会导致图像专业知识的灾难性遗忘。最近直接从图像扩散模型中提取视频片段的尝试在一定程度上缓解了这些问题。然而,这些方法只能生成简短的视频剪辑,具有简单的运动,并且无法捕捉细粒度运动或非网格变形。在本文中,我们提出了一种新颖的零样本视频采样算法,称为ZS^2,能够直接从现有的图像合成方法(如稳定扩散)中采样高质量的视频剪辑,而无需任何训练或优化。具体而言,ZS^2利用依赖噪声模型和时间动量注意力来确保内容一致性和动画连贯性。这种能力使其在相关任务中表现出色,例如条件和上下文专门化视频生成以及指导视频编辑。实验结果表明,ZS^2在零样本视频生成方面取得了最先进的性能,在某些情况下甚至优于最近的监督方法。 主页:https://densechen.github.io/zss/。
English
Incorporating a temporal dimension into pretrained image diffusion models for video generation is a prevalent approach. However, this method is computationally demanding and necessitates large-scale video datasets. More critically, the heterogeneity between image and video datasets often results in catastrophic forgetting of the image expertise. Recent attempts to directly extract video snippets from image diffusion models have somewhat mitigated these problems. Nevertheless, these methods can only generate brief video clips with simple movements and fail to capture fine-grained motion or non-grid deformation. In this paper, we propose a novel Zero-Shot video Sampling algorithm, denoted as ZS^2, capable of directly sampling high-quality video clips from existing image synthesis methods, such as Stable Diffusion, without any training or optimization. Specifically, ZS^2 utilizes the dependency noise model and temporal momentum attention to ensure content consistency and animation coherence, respectively. This ability enables it to excel in related tasks, such as conditional and context-specialized video generation and instruction-guided video editing. Experimental results demonstrate that ZS^2 achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods. Homepage: https://densechen.github.io/zss/.

Summary

AI-Generated Summary

PDF62November 28, 2024