重新呈現視頻：零樣本文本引導的視頻到視頻翻譯

摘要

大型文本到圖像擴散模型展現了在生成高品質圖像方面的出色能力。然而，當將這些模型應用於視頻領域時，確保視頻幀之間的時間一致性仍然是一個艱鉅的挑戰。本文提出了一種新穎的零樣本文本引導的視頻到視頻翻譯框架，以適應將圖像模型應用於視頻。該框架包括兩部分：關鍵幀翻譯和完整視頻翻譯。第一部分使用適應的擴散模型生成關鍵幀，並應用分層交叉幀約束以強制形狀、紋理和顏色的一致性。第二部分通過具有時間感知的補丁匹配和幀混合將關鍵幀傳播到其他幀。我們的框架以低成本實現了全局風格和局部紋理的時間一致性（無需重新訓練或優化）。該適應與現有的圖像擴散技術兼容，使我們的框架能夠利用這些技術，例如使用LoRA定制特定主題，並使用ControlNet引入額外的空間引導。大量實驗結果證明了我們提出的框架在呈現高品質和時間一致性視頻方面相對於現有方法的有效性。

English

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

重新呈現視頻：零樣本文本引導的視頻到視頻翻譯

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

摘要

Support