因子分解拡散蒸留によるビデオ編集

要旨

私たちはEmu Video Edit（EVE）を紹介します。これは、教師ありのビデオ編集データに依存せずに、ビデオ編集において新たな最先端を確立するモデルです。EVEを開発するために、私たちは画像編集アダプターとビデオ生成アダプターを別々に訓練し、両方を同じテキストから画像へのモデルに接続します。次に、ビデオ編集に向けてこれらのアダプターを調整するために、新しい教師なし蒸留手順であるFactorized Diffusion Distillationを導入します。この手順は、教師データなしで、1つ以上の教師から同時に知識を蒸留します。私たちはこの手順を利用して、EVEにビデオを編集する方法を教えます。具体的には、（i）画像編集アダプターから各フレームを正確に編集する知識を共同で蒸留し、（ii）ビデオ生成アダプターを使用して編集されたフレーム間の時間的一貫性を確保します。最後に、私たちのアプローチが他の能力を解放する可能性を示すために、追加のアダプターの組み合わせを調整します。

English

We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters

因子分解拡散蒸留によるビデオ編集

Video Editing via Factorized Diffusion Distillation

要旨

Support