ChatPaper.aiChatPaper

細粒度零樣本視頻採樣

Fine-gained Zero-shot Video Sampling

July 31, 2024
作者: Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu
cs.AI

摘要

將時間維度融入預訓練的圖像擴散模型以進行視頻生成是一種常見的方法。然而,這種方法在計算上要求高,需要大規模的視頻數據集。更為關鍵的是,圖像和視頻數據集之間的異質性往往導致圖像專業知識的災難性遺忘。最近直接從圖像擴散模型中提取視頻片段的嘗試在一定程度上緩解了這些問題。然而,這些方法只能生成簡短的視頻片段,動作簡單,無法捕捉細緻的運動或非網格變形。本文提出了一種新穎的零樣本視頻採樣算法,稱為ZS^2,能夠直接從現有的圖像合成方法(如穩定擴散)中採樣高質量的視頻片段,無需任何訓練或優化。具體而言,ZS^2利用依賴性噪聲模型和時間動量關注來確保內容一致性和動畫連貫性。這種能力使其在相關任務中表現卓越,如有條件的和上下文專用的視頻生成以及指導的視頻編輯。實驗結果表明,ZS^2在零樣本視頻生成方面取得了最先進的性能,有時優於最近的監督方法。 主頁:https://densechen.github.io/zss/。
English
Incorporating a temporal dimension into pretrained image diffusion models for video generation is a prevalent approach. However, this method is computationally demanding and necessitates large-scale video datasets. More critically, the heterogeneity between image and video datasets often results in catastrophic forgetting of the image expertise. Recent attempts to directly extract video snippets from image diffusion models have somewhat mitigated these problems. Nevertheless, these methods can only generate brief video clips with simple movements and fail to capture fine-grained motion or non-grid deformation. In this paper, we propose a novel Zero-Shot video Sampling algorithm, denoted as ZS^2, capable of directly sampling high-quality video clips from existing image synthesis methods, such as Stable Diffusion, without any training or optimization. Specifically, ZS^2 utilizes the dependency noise model and temporal momentum attention to ensure content consistency and animation coherence, respectively. This ability enables it to excel in related tasks, such as conditional and context-specialized video generation and instruction-guided video editing. Experimental results demonstrate that ZS^2 achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods. Homepage: https://densechen.github.io/zss/.

Summary

AI-Generated Summary

PDF62November 28, 2024