I2VEdit: 画像-動画拡散モデルによる初フレーム誘導型ビデオ編集

要旨

拡散モデルの驚異的な生成能力は、画像および動画編集の両分野で広範な研究を促してきました。時間次元における追加的な課題に直面する動画編集と比較して、画像編集ではより多様で高品質なアプローチやPhotoshopのような高度なソフトウェアの開発が進んでいます。このギャップを踏まえ、我々は、事前学習済みの画像-動画モデルを用いて単一フレームからの編集を動画全体に伝播させることで、画像編集ツールの適用範囲を動画に拡張する新規で汎用的なソリューションを提案します。我々の手法「I2VEdit」は、編集の程度に応じてソース動画の視覚的および運動的整合性を適応的に保持し、既存手法では完全に達成できないグローバル編集、ローカル編集、中程度の形状変化を効果的に処理します。我々の手法の中核には、2つの主要なプロセスがあります：元の動画と基本的な運動パターンを整合させるための「Coarse Motion Extraction」と、細粒度のアテンションマッチングを用いた精密な調整を行う「Appearance Refinement」です。また、複数の動画クリップにわたる自己回帰生成による品質劣化を軽減するために、スキップ間隔戦略を組み込んでいます。実験結果は、我々のフレームワークが細粒度の動画編集において優れた性能を発揮し、高品質で時間的に一貫した出力を生成できることを実証しています。

English

The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework's superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.

I2VEdit: 画像-動画拡散モデルによる初フレーム誘導型ビデオ編集

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

要旨

Support