StoryDiffusion:用于长距离图像和视频生成的一致自注意力
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
May 2, 2024
作者: Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou
cs.AI
摘要
对于最近基于扩散的生成模型,特别是那些包含主题和复杂细节的一系列生成图像,保持一致的内容呈现是一个重要挑战。在本文中,我们提出了一种新的自注意力计算方式,称为一致性自注意力,显著增强了生成图像之间的一致性,并以零-shot方式增强了流行的预训练基于扩散的文本到图像模型。为了将我们的方法扩展到长距离视频生成,我们进一步引入了一种新颖的语义空间时间运动预测模块,名为语义运动预测器。它被训练用于估计语义空间中两个提供的图像之间的运动条件。该模块将生成的图像序列转换为具有平滑过渡和一致主题的视频,比仅基于潜在空间的模块在长视频生成环境中更加稳定。通过将这两个新颖组件合并,我们的框架,称为StoryDiffusion,可以描述一个基于文本的故事,其中包含丰富多样的内容的一致图像或视频。所提出的StoryDiffusion 在视觉故事生成方面进行了开创性探索,展示了图像和视频,我们希望能够从架构修改的角度激发更多研究。我们的代码已公开发布在 https://github.com/HVision-NKU/StoryDiffusion。
English
For recent diffusion-based generative models, maintaining consistent content
across a series of generated images, especially those containing subjects and
complex details, presents a significant challenge. In this paper, we propose a
new way of self-attention calculation, termed Consistent Self-Attention, that
significantly boosts the consistency between the generated images and augments
prevalent pretrained diffusion-based text-to-image models in a zero-shot
manner. To extend our method to long-range video generation, we further
introduce a novel semantic space temporal motion prediction module, named
Semantic Motion Predictor. It is trained to estimate the motion conditions
between two provided images in the semantic spaces. This module converts the
generated sequence of images into videos with smooth transitions and consistent
subjects that are significantly more stable than the modules based on latent
spaces only, especially in the context of long video generation. By merging
these two novel components, our framework, referred to as StoryDiffusion, can
describe a text-based story with consistent images or videos encompassing a
rich variety of contents. The proposed StoryDiffusion encompasses pioneering
explorations in visual story generation with the presentation of images and
videos, which we hope could inspire more research from the aspect of
architectural modifications. Our code is made publicly available at
https://github.com/HVision-NKU/StoryDiffusion.Summary
AI-Generated Summary