StoryDiffusion：實現長序列圖像與影片生成的一致性自注意力機制

摘要

對於近期基於擴散模型的生成式模型而言，在生成圖像序列（特別是包含主體與複雜細節的內容）時保持內容一致性存在顯著挑戰。本文提出一種新型自注意力計算方法——一致性自注意力，能在零樣本設定下顯著提升生成圖像間的連貫性，並增強現有預訓練文生圖擴散模型的效果。為將該方法擴展至長時序影片生成，我們進一步設計了語義空間時序運動預測模組「語義運動預測器」。該模組經訓練可估算兩張輸入圖像在語義空間中的運動條件，能將生成圖像序列轉換為過渡平滑、主體連貫的影片，其穩定性顯著優於僅基於潛在空間的模組，尤其在長影片生成場景中表現突出。通過融合這兩項創新組件，我們的StoryDiffusion框架能夠以連貫的圖像或影片形式呈現包含豐富多樣內容的文本故事。本研究在視覺故事生成領域實現了圖像與影片協同呈現的開創性探索，期望能從架構改進的角度激發更多相關研究。程式碼已公開於：https://github.com/HVision-NKU/StoryDiffusion。

English

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.

StoryDiffusion：實現長序列圖像與影片生成的一致性自注意力機制

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

摘要

Support