PEEKABOO: マスク拡散によるインタラクティブな動画生成

要旨

最近、テキストから動画を生成する技術が大きく進歩し、最先端のモデルは高品質でリアルな動画を生成できるようになりました。しかし、これらのモデルにはユーザーがインタラクティブに制御して動画を生成する機能が欠けており、これが新たな応用分野の可能性を秘めています。この目標に向けた第一歩として、我々は拡散ベースの動画生成モデルに、出力に対するインタラクティブな時空間制御を付与する問題に取り組みます。この目的のために、最近のセグメンテーション研究の進展に着想を得て、新しい時空間マスク付きアテンションモジュール「Peekaboo」を提案します。このモジュールは、既存の動画生成モデルに追加可能で、トレーニング不要かつ推論時のオーバーヘッドなしに時空間制御を可能にします。また、インタラクティブな動画生成タスクのための評価ベンチマークを提案します。広範な定性的および定量的評価を通じて、Peekabooが制御可能な動画生成を実現し、ベースラインモデルに対して最大3.8倍のmIoU向上を達成することを確認しました。

English

Recently there has been a lot of progress in text-to-video generation, with state-of-the-art models being capable of generating high quality, realistic videos. However, these models lack the capability for users to interactively control and generate videos, which can potentially unlock new areas of application. As a first step towards this goal, we tackle the problem of endowing diffusion-based video generation models with interactive spatio-temporal control over their output. To this end, we take inspiration from the recent advances in segmentation literature to propose a novel spatio-temporal masked attention module - Peekaboo. This module is a training-free, no-inference-overhead addition to off-the-shelf video generation models which enables spatio-temporal control. We also propose an evaluation benchmark for the interactive video generation task. Through extensive qualitative and quantitative evaluation, we establish that Peekaboo enables control video generation and even obtains a gain of upto 3.8x in mIoU over baseline models.

PEEKABOO: マスク拡散によるインタラクティブな動画生成

PEEKABOO: Interactive Video Generation via Masked-Diffusion

要旨

Support