ビデオの再レンダリング：ゼロショットテキストガイドによるビデオ間変換

要旨

大規模なテキストから画像への拡散モデルは、高品質な画像生成において印象的な能力を発揮しています。しかし、これらのモデルをビデオ領域に適用する際、ビデオフレーム間の時間的整合性を確保することは依然として大きな課題です。本論文では、画像モデルをビデオに適応させるための新しいゼロショットテキストガイド付きビデオツービデオ翻訳フレームワークを提案します。このフレームワークは、キーフレーム翻訳とフルビデオ翻訳の2つの部分で構成されています。最初の部分では、適応された拡散モデルを使用してキーフレームを生成し、形状、テクスチャ、色の一貫性を確保するために階層的なクロスフレーム制約を適用します。2番目の部分では、時間的認識を伴うパッチマッチングとフレームブレンディングを使用して、キーフレームを他のフレームに伝播させます。私たちのフレームワークは、再トレーニングや最適化を必要とせずに、グローバルなスタイルとローカルなテクスチャの時間的整合性を低コストで実現します。この適応は既存の画像拡散技術と互換性があり、LoRAを使用して特定の主題をカスタマイズしたり、ControlNetを使用して追加の空間的ガイダンスを導入したりするなど、これらの技術を活用することができます。広範な実験結果は、提案されたフレームワークが既存の方法よりも高品質で時間的に一貫したビデオをレンダリングする上で有効であることを示しています。

English

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

ビデオの再レンダリング：ゼロショットテキストガイドによるビデオ間変換

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

要旨

Support