AnyV2V: 任意のビデオ間編集タスクのためのプラグアンドプレイフレームワーク

要旨

ビデオツービデオ編集は、ソースビデオと追加の制御（テキストプロンプト、被写体、スタイルなど）を組み合わせて、ソースビデオと提供された制御に沿った新しいビデオを生成する編集手法である。従来の手法は特定の編集タイプに制限されており、多様なユーザー要求に対応する能力が限られていた。本論文では、AnyV2Vという新しいトレーニング不要のフレームワークを紹介する。このフレームワークは、ビデオ編集を2つの主要なステップに簡素化することを目的としている：(1) 既存の画像編集モデル（例：InstructPix2Pix、InstantIDなど）を使用して最初のフレームを修正し、(2) 既存の画像ツービデオ生成モデル（例：I2VGen-XL）を利用してDDIM逆変換と特徴注入を行う。第一段階では、AnyV2Vは既存の画像編集ツールを組み込むことで、幅広いビデオ編集タスクをサポートできる。従来のプロンプトベースの編集手法に加えて、AnyV2Vは参照ベースのスタイル転送、被写体駆動編集、アイデンティティ操作といった新しいビデオ編集タスクもサポートし、これらは従来の手法では実現不可能であった。第二段階では、AnyV2Vは既存の画像ツービデオモデルを組み込むことで、DDIM逆変換と中間特徴注入を行い、ソースビデオとの外観と動きの一貫性を維持する。プロンプトベースの編集において、AnyV2Vは従来の最良の手法よりもプロンプト整合性で35%、人間の好みで25%優れていることを示す。3つの新しいタスクにおいても、AnyV2Vは高い成功率を達成している。AnyV2Vは、急速に進化する画像編集手法をシームレスに統合する能力により、今後も発展を続けると確信している。この互換性により、AnyV2Vは多様なユーザー要求に対応するための汎用性をさらに高めることができる。

English

Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.

AnyV2V: 任意のビデオ間編集タスクのためのプラグアンドプレイフレームワーク

AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks

要旨

Support