LAMP:学习少样本视频生成的运动模式
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
October 16, 2023
作者: Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, Xiangyu Zhang
cs.AI
摘要
随着基于扩散的文本到图像生成技术取得的显著进展,将这种强大的生成能力扩展到文本到视频引起了极大关注。现有方法要么需要大规模的文本-视频对和大量的训练资源,要么学习与模板视频精确对齐的动作。在视频生成中平衡生成自由度与资源成本之间的权衡并不简单。在我们的研究中,我们提出了一种基于少样本调整的框架,LAMP,它可以在单个GPU上使用8~16个视频训练文本到图像扩散模型以学习特定的运动模式。具体来说,我们设计了一个以第一帧为条件的流水线,该流水线使用现成的文本到图像模型进行内容生成,使我们调整的视频扩散模型主要集中于学习运动。成熟的文本到图像技术可以提供视觉上令人愉悦和多样化的内容作为生成条件,这极大地提高了视频质量和生成自由度。为了捕捉时间维度的特征,我们将预训练的2D卷积层扩展到我们的新颖时空运动学习层,并将注意力块修改为时间级别。此外,我们开发了一种有效的推断技巧,即共享噪声采样,可以提高视频的稳定性并降低计算成本。我们的方法还可以灵活应用于其他任务,例如真实世界图像动画和视频编辑。大量实验证明,LAMP能够有效地从有限数据中学习运动模式并生成高质量视频。代码和模型可在https://rq-wu.github.io/projects/LAMP 上获取。
English
With the impressive progress in diffusion-based text-to-image generation,
extending such powerful generative ability to text-to-video raises enormous
attention. Existing methods either require large-scale text-video pairs and a
large number of training resources or learn motions that are precisely aligned
with template videos. It is non-trivial to balance a trade-off between the
degree of generation freedom and the resource costs for video generation. In
our study, we present a few-shot-based tuning framework, LAMP, which enables
text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos
on a single GPU. Specifically, we design a first-frame-conditioned pipeline
that uses an off-the-shelf text-to-image model for content generation so that
our tuned video diffusion model mainly focuses on motion learning. The
well-developed text-to-image techniques can provide visually pleasing and
diverse content as generation conditions, which highly improves video quality
and generation freedom. To capture the features of temporal dimension, we
expand the pretrained 2D convolution layers of the T2I model to our novel
temporal-spatial motion learning layers and modify the attention blocks to the
temporal level. Additionally, we develop an effective inference trick,
shared-noise sampling, which can improve the stability of videos with
computational costs. Our method can also be flexibly applied to other tasks,
e.g. real-world image animation and video editing. Extensive experiments
demonstrate that LAMP can effectively learn the motion pattern on limited data
and generate high-quality videos. The code and models are available at
https://rq-wu.github.io/projects/LAMP.