StreamMultiDiffusion: 領域ベースの意味的制御によるリアルタイムインタラクティブ生成

要旨

テキストから画像を生成する拡散モデルの驚異的な成功は、次世代のエンドユーザー向け画像生成・編集アプリケーションの有望な候補として注目を集めています。これまでの研究では、推論時間の短縮や、領域ベースのテキストプロンプトといった新たな細粒度制御を可能にすることで、拡散モデルの使いやすさを向上させることに焦点が当てられてきました。しかし、私たちの実証的な調査によると、これら二つの研究分野を統合することは容易ではなく、拡散モデルの潜在能力を制限しています。この非互換性を解決するため、私たちは初のリアルタイム領域ベーステキスト画像生成フレームワーク「StreamMultiDiffusion」を提案します。高速推論技術を安定化させ、新たに提案されたマルチプロンプトストリームバッチアーキテクチャにモデルを再構築することで、既存のソリューションと比較して10倍高速なパノラマ生成を実現し、単一のRTX 2080 Ti GPU上で1.57 FPSの領域ベーステキスト画像生成速度を達成しました。私たちのソリューションは、複数の手描き領域からリアルタイムで高品質な画像を生成する「セマンティックパレット」という新しいインタラクティブ画像生成パラダイムを切り開きます。これらの領域は、事前に定義された意味（例：ワシ、少女）をエンコードしています。私たちのコードとデモアプリケーションはhttps://github.com/ironjr/StreamMultiDiffusionで公開されています。

English

The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing. Previous works have focused on improving the usability of diffusion models by reducing the inference time or increasing user interactivity by allowing new, fine-grained controls such as region-based text prompts. However, we empirically find that integrating both branches of works is nontrivial, limiting the potential of diffusion models. To solve this incompatibility, we present StreamMultiDiffusion, the first real-time region-based text-to-image generation framework. By stabilizing fast inference techniques and restructuring the model into a newly proposed multi-prompt stream batch architecture, we achieve times 10 faster panorama generation than existing solutions, and the generation speed of 1.57 FPS in region-based text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a new paradigm for interactive image generation named semantic palette, where high-quality images are generated in real-time from given multiple hand-drawn regions, encoding prescribed semantic meanings (e.g., eagle, girl). Our code and demo application are available at https://github.com/ironjr/StreamMultiDiffusion.

StreamMultiDiffusion: 領域ベースの意味的制御によるリアルタイムインタラクティブ生成

StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

要旨

Summary

Support

Support