ChatPaper.aiChatPaper

StoryDiffusion:面向长序列图像与视频生成的一致性自注意力机制

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

May 2, 2024
作者: Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou
cs.AI

摘要

对于近期基于扩散的生成模型而言,在生成包含主体和复杂细节的图像序列时保持内容一致性仍存在显著挑战。本文提出一种新型自注意力计算机制——一致性自注意力,能够在零样本条件下显著提升生成图像间的连贯性,并增强现有预训练文生图扩散模型的效果。为将方法拓展至长视频生成领域,我们进一步设计了语义空间时序运动预测模块。该模块通过训练学习在语义空间中估算给定图像间的运动条件,可将生成的图像序列转化为具有平滑过渡与稳定主体的视频,尤其在生成长视频时展现出远优于仅基于潜空间方法的稳定性。通过融合这两项创新组件,我们的StoryDiffusion框架能够用包含丰富内容的连贯图像或视频来呈现文本故事。该框架实现了通过图像与视频进行视觉故事生成的先驱性探索,我们期望其能从架构改进角度激发更多相关研究。代码已公开于:https://github.com/HVision-NKU/StoryDiffusion。
English
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.
PDF563February 8, 2026