PEEKABOO: 마스크-디퓨전 기반 인터랙티브 비디오 생성

초록

최근 텍스트-투-비디오 생성 분야에서 많은 진전이 있었으며, 최첨단 모델들은 고품질의 사실적인 비디오를 생성할 수 있게 되었습니다. 그러나 이러한 모델들은 사용자가 비디오를 상호작용적으로 제어하고 생성할 수 있는 기능이 부족하여, 잠재적으로 새로운 응용 분야를 열 수 있는 가능성이 있습니다. 이 목표를 향한 첫 번째 단계로, 우리는 확산 기반 비디오 생성 모델에 출력에 대한 상호작용적 시공간 제어 기능을 부여하는 문제를 다룹니다. 이를 위해, 우리는 최근의 세그멘테이션 문헌에서의 발전에서 영감을 받아 새로운 시공간 마스크 어텐션 모듈인 Peekaboo를 제안합니다. 이 모듈은 기존의 비디오 생성 모델에 추가할 수 있는 학습이 필요 없고 추론 오버헤드가 없는 방식으로 시공간 제어를 가능하게 합니다. 또한, 우리는 상호작용적 비디오 생성 작업을 위한 평가 벤치마크를 제안합니다. 광범위한 정성적 및 정량적 평가를 통해, Peekaboo가 비디오 생성 제어를 가능하게 하고, 기준 모델 대비 최대 3.8배의 mIoU 향상을 달성함을 입증합니다.

English

Recently there has been a lot of progress in text-to-video generation, with state-of-the-art models being capable of generating high quality, realistic videos. However, these models lack the capability for users to interactively control and generate videos, which can potentially unlock new areas of application. As a first step towards this goal, we tackle the problem of endowing diffusion-based video generation models with interactive spatio-temporal control over their output. To this end, we take inspiration from the recent advances in segmentation literature to propose a novel spatio-temporal masked attention module - Peekaboo. This module is a training-free, no-inference-overhead addition to off-the-shelf video generation models which enables spatio-temporal control. We also propose an evaluation benchmark for the interactive video generation task. Through extensive qualitative and quantitative evaluation, we establish that Peekaboo enables control video generation and even obtains a gain of upto 3.8x in mIoU over baseline models.

PEEKABOO: 마스크-디퓨전 기반 인터랙티브 비디오 생성

PEEKABOO: Interactive Video Generation via Masked-Diffusion

초록

Support