MaskINT:通過插值非自回歸遮罩變壓器進行視頻編輯
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
December 19, 2023
作者: Haoyu Ma, Shahin Mahdizadehaghdam, Bichen Wu, Zhipeng Fan, Yuchao Gu, Wenliang Zhao, Lior Shapira, Xiaohui Xie
cs.AI
摘要
最近在生成式人工智慧方面取得的進展顯著增強了影像和影片編輯,特別是在文字提示控制的情境下。目前最先進的方法主要依賴擴散模型來完成這些任務。然而,基於擴散的方法在計算需求上相當龐大,通常需要大規模的配對數據集進行訓練,因此在實際應用中具有挑戰性。本研究通過將基於文字的影片編輯過程分解為兩個獨立階段來應對這一挑戰。在第一階段,我們利用現有的文字轉圖像擴散模型同時編輯少數關鍵幀而無需額外微調。在第二階段,我們引入了一個名為MaskINT的高效模型,該模型基於非自回歸遮罩式生成式轉換器構建,專門用於關鍵幀之間的幀內插,並受益於中間幀提供的結構引導。我們全面的一系列實驗展示了MaskINT相對於其他基於擴散的方法論的有效性和效率。這項研究提供了一個針對基於文字的影片編輯的實際解決方案,展示了非自回歸遮罩式生成式轉換器在該領域的潛力。
English
Recent advances in generative AI have significantly enhanced image and video
editing, particularly in the context of text prompt control. State-of-the-art
approaches predominantly rely on diffusion models to accomplish these tasks.
However, the computational demands of diffusion-based methods are substantial,
often necessitating large-scale paired datasets for training, and therefore
challenging the deployment in practical applications. This study addresses this
challenge by breaking down the text-based video editing process into two
separate stages. In the first stage, we leverage an existing text-to-image
diffusion model to simultaneously edit a few keyframes without additional
fine-tuning. In the second stage, we introduce an efficient model called
MaskINT, which is built on non-autoregressive masked generative transformers
and specializes in frame interpolation between the keyframes, benefiting from
structural guidance provided by intermediate frames. Our comprehensive set of
experiments illustrates the efficacy and efficiency of MaskINT when compared to
other diffusion-based methodologies. This research offers a practical solution
for text-based video editing and showcases the potential of non-autoregressive
masked generative transformers in this domain.