PEEKABOO：通過遮罩擴散進行互動式視頻生成

摘要

最近在文本轉視頻生成方面取得了很大的進展，最先進的模型能夠生成高質量、逼真的視頻。然而，這些模型缺乏用戶互動控制和生成視頻的能力，這可能會開啟新的應用領域。作為實現這一目標的第一步，我們著手解決賦予基於擴散的視頻生成模型對其輸出進行互動式時空控制的問題。為此，我們從最近在分割文獻中的進展中汲取靈感，提出了一個新穎的時空遮罩關注模塊 - Peekaboo。該模塊是一個無需訓練、無推理開銷的附加組件，可應用於現成的視頻生成模型，實現時空控制。我們還提出了一個互動式視頻生成任務的評估基準。通過廣泛的定性和定量評估，我們確定 Peekaboo 實現了對視頻生成的控制，甚至在 mIoU 上取得了高達 3.8 倍的增益，超過了基準模型。

English

Recently there has been a lot of progress in text-to-video generation, with state-of-the-art models being capable of generating high quality, realistic videos. However, these models lack the capability for users to interactively control and generate videos, which can potentially unlock new areas of application. As a first step towards this goal, we tackle the problem of endowing diffusion-based video generation models with interactive spatio-temporal control over their output. To this end, we take inspiration from the recent advances in segmentation literature to propose a novel spatio-temporal masked attention module - Peekaboo. This module is a training-free, no-inference-overhead addition to off-the-shelf video generation models which enables spatio-temporal control. We also propose an evaluation benchmark for the interactive video generation task. Through extensive qualitative and quantitative evaluation, we establish that Peekaboo enables control video generation and even obtains a gain of upto 3.8x in mIoU over baseline models.

PEEKABOO：通過遮罩擴散進行互動式視頻生成

PEEKABOO: Interactive Video Generation via Masked-Diffusion

摘要

Support