StoryDiffusion: 長距離画像・動画生成のための一貫性ある自己注意機構

要旨

近年の拡散モデルに基づく生成モデルにおいて、特に被写体や複雑なディテールを含む一連の生成画像間で一貫性を維持することは、大きな課題となっています。本論文では、生成画像間の一貫性を大幅に向上させ、既存の事前学習済み拡散モデルをゼロショットで拡張する新しいセルフアテンション計算手法「Consistent Self-Attention」を提案します。さらに、長尺動画生成への適用を可能にするため、セマンティック空間における時間的モーションプレディクションモジュール「Semantic Motion Predictor」を新たに導入します。このモジュールは、2つの画像間のモーション条件をセマンティック空間で推定するように訓練され、生成された画像シーケンスを滑らかな遷移と一貫した被写体を持つ動画に変換します。特に長尺動画生成において、潜在空間のみに基づくモジュールよりも大幅に安定した結果を実現します。これら2つの新規コンポーネントを統合した我々のフレームワーク「StoryDiffusion」は、テキストベースのストーリーを、多様な内容を含む一貫性のある画像や動画で表現することができます。提案するStoryDiffusionは、画像と動画を用いた視覚的ストーリー生成における先駆的な探求を包含しており、アーキテクチャ変更の観点からさらなる研究を刺激することを期待しています。コードはhttps://github.com/HVision-NKU/StoryDiffusionで公開されています。

English

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.

StoryDiffusion: 長距離画像・動画生成のための一貫性ある自己注意機構

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

要旨

Support