MaskINT：通过插值非自回归遮蔽变换器进行视频编辑

摘要

最近生成式人工智能的进展显著增强了图像和视频编辑，特别是在文本提示控制的背景下。当前最先进的方法主要依赖扩散模型来完成这些任务。然而，基于扩散的方法的计算需求很大，通常需要大规模配对数据集进行训练，因此在实际应用中具有挑战性。本研究通过将基于文本的视频编辑过程分解为两个独立阶段来解决这一挑战。在第一阶段，我们利用现有的文本到图像扩散模型同时编辑一些关键帧而无需额外微调。在第二阶段，我们引入了一种高效的模型称为MaskINT，它基于非自回归蒙版生成变压器构建，专门用于关键帧之间的帧插值，从中间帧提供的结构指导中获益。我们的一系列综合实验展示了与其他基于扩散的方法相比，MaskINT的有效性和效率。这项研究为基于文本的视频编辑提供了实际解决方案，并展示了非自回归蒙版生成变压器在该领域的潜力。

English

Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

MaskINT：通过插值非自回归遮蔽变换器进行视频编辑

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

摘要

Support