TokenFlow：一貫したビデオ編集のための一貫性のある拡散特徴

要旨

生成AIの革命は最近、動画分野にも拡大しています。しかしながら、現在の最先端の動画モデルは、視覚的な品質や生成コンテンツに対するユーザーコントロールの面で、まだ画像モデルに遅れを取っています。本研究では、テキストから画像を生成する拡散モデルの力を活用して、テキスト駆動型の動画編集を行うフレームワークを提案します。具体的には、ソース動画とターゲットのテキストプロンプトが与えられた場合、本手法はターゲットテキストに従いながら、入力動画の空間的レイアウトと動きを保持する高品質な動画を生成します。本手法は、編集された動画の一貫性は、拡散特徴空間における一貫性を強制することで得られるという重要な観察に基づいています。これを実現するために、モデル内で容易に利用可能なフレーム間の対応関係に基づいて、拡散特徴を明示的に伝播させます。したがって、本フレームワークは、追加のトレーニングやファインチューニングを必要とせず、既存のテキストから画像を編集する手法と組み合わせて使用することができます。我々は、様々な実世界の動画に対して最先端の編集結果を示します。ウェブページ: https://diffusion-tokenflow.github.io/

English

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

TokenFlow：一貫したビデオ編集のための一貫性のある拡散特徴

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

要旨

Support