Pix2Gif:用于GIF生成的运动引导扩散
Pix2Gif: Motion-Guided Diffusion for GIF Generation
March 7, 2024
作者: Hitesh Kandala, Jianfeng Gao, Jianwei Yang
cs.AI
摘要
我们提出了Pix2Gif,这是一种用于图像到GIF(视频)生成的运动引导扩散模型。我们通过将任务制定为由文本和运动幅度提示引导的图像翻译问题来解决这个问题,如teaser fig所示。为了确保模型遵循运动引导,我们提出了一种新的运动引导变形模块,用于在两种类型的提示的条件下对源图像的特征进行空间变换。此外,我们引入了感知损失,以确保转换后的特征图保持在与目标图像相同的空间内,确保内容的一致性和连贯性。为了为模型训练做准备,我们通过从TGIF视频字幕数据集中提取连贯的图像帧来精心筛选数据,该数据集提供了关于主体的时间变化的丰富信息。在预训练之后,我们以零-shot方式将我们的模型应用于多个视频数据集。广泛的定性和定量实验证明了我们模型的有效性 - 它不仅捕捉了来自文本的语义提示,还捕捉了来自运动引导的空间提示。我们使用16xV100 GPU的单节点训练了所有模型。代码、数据集和模型已在以下网址公开:https://hiteshk03.github.io/Pix2Gif/。
English
We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video)
generation. We tackle this problem differently by formulating the task as an
image translation problem steered by text and motion magnitude prompts, as
shown in teaser fig. To ensure that the model adheres to motion guidance, we
propose a new motion-guided warping module to spatially transform the features
of the source image conditioned on the two types of prompts. Furthermore, we
introduce a perceptual loss to ensure the transformed feature map remains
within the same space as the target image, ensuring content consistency and
coherence. In preparation for the model training, we meticulously curated data
by extracting coherent image frames from the TGIF video-caption dataset, which
provides rich information about the temporal changes of subjects. After
pretraining, we apply our model in a zero-shot manner to a number of video
datasets. Extensive qualitative and quantitative experiments demonstrate the
effectiveness of our model -- it not only captures the semantic prompt from
text but also the spatial ones from motion guidance. We train all our models
using a single node of 16xV100 GPUs. Code, dataset and models are made public
at: https://hiteshk03.github.io/Pix2Gif/.