MaskINT: 補間型非自己回帰マスクトランスフォーマーによるビデオ編集

要旨

近年の生成AIの進展により、特にテキストプロンプト制御の文脈において、画像や動画の編集が大幅に向上しています。最先端のアプローチでは、これらのタスクを達成するために主に拡散モデルが用いられています。しかし、拡散ベースの手法は計算コストが高く、大規模なペアデータセットを必要とするため、実用アプリケーションへの展開が困難です。本研究では、この課題に対処するため、テキストベースの動画編集プロセスを2つの別々の段階に分割します。最初の段階では、既存のテキストから画像への拡散モデルを活用し、追加のファインチューニングなしに少数のキーフレームを同時に編集します。第二段階では、非自己回帰型マスク生成トランスフォーマーに基づく効率的なモデルであるMaskINTを導入し、中間フレームから提供される構造的ガイダンスを活用してキーフレーム間のフレーム補間に特化します。私たちの包括的な実験セットは、MaskINTの有効性と効率性を他の拡散ベースの手法と比較して示しています。この研究は、テキストベースの動画編集に対する実用的なソリューションを提供し、この領域における非自己回帰型マスク生成トランスフォーマーの可能性を示しています。

English

Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

MaskINT: 補間型非自己回帰マスクトランスフォーマーによるビデオ編集

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

要旨

Support