MaskINT：通過插值非自回歸遮罩變壓器進行視頻編輯

摘要

最近在生成式人工智慧方面取得的進展顯著增強了影像和影片編輯，特別是在文字提示控制的情境下。目前最先進的方法主要依賴擴散模型來完成這些任務。然而，基於擴散的方法在計算需求上相當龐大，通常需要大規模的配對數據集進行訓練，因此在實際應用中具有挑戰性。本研究通過將基於文字的影片編輯過程分解為兩個獨立階段來應對這一挑戰。在第一階段，我們利用現有的文字轉圖像擴散模型同時編輯少數關鍵幀而無需額外微調。在第二階段，我們引入了一個名為MaskINT的高效模型，該模型基於非自回歸遮罩式生成式轉換器構建，專門用於關鍵幀之間的幀內插，並受益於中間幀提供的結構引導。我們全面的一系列實驗展示了MaskINT相對於其他基於擴散的方法論的有效性和效率。這項研究提供了一個針對基於文字的影片編輯的實際解決方案，展示了非自回歸遮罩式生成式轉換器在該領域的潛力。

English

Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

MaskINT：通過插值非自回歸遮罩變壓器進行視頻編輯

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

摘要

Support