StreamMultiDiffusion：基於區域語義控制的即時互動生成

摘要

擴散模型在文本到圖像合成中取得了巨大成功，使其成為下一代圖像生成和編輯的應用程序的有前途的候選者。先前的研究著重於通過減少推理時間或增加用戶互動性來改善擴散模型的可用性，例如允許新的、精細的控制，如基於區域的文本提示。然而，我們在實證中發現整合這兩個研究分支並不簡單，限制了擴散模型的潛力。為了解決這種不相容性，我們提出了StreamMultiDiffusion，這是第一個實時基於區域的文本到圖像生成框架。通過穩定快速推理技術並將模型重組為新提出的多提示流批次架構，我們實現了比現有解決方案快10倍的全景生成速度，以及在單個RTX 2080 Ti GPU上基於區域的文本到圖像合成的1.57 FPS生成速度。我們的解決方案開創了一種名為語義調色板的互動式圖像生成新範式，可以即時從給定的多個手繪區域生成高質量圖像，編碼預定的語義含義（例如，鷹，女孩）。我們的代碼和演示應用程序可在https://github.com/ironjr/StreamMultiDiffusion 上找到。

English

The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing. Previous works have focused on improving the usability of diffusion models by reducing the inference time or increasing user interactivity by allowing new, fine-grained controls such as region-based text prompts. However, we empirically find that integrating both branches of works is nontrivial, limiting the potential of diffusion models. To solve this incompatibility, we present StreamMultiDiffusion, the first real-time region-based text-to-image generation framework. By stabilizing fast inference techniques and restructuring the model into a newly proposed multi-prompt stream batch architecture, we achieve times 10 faster panorama generation than existing solutions, and the generation speed of 1.57 FPS in region-based text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a new paradigm for interactive image generation named semantic palette, where high-quality images are generated in real-time from given multiple hand-drawn regions, encoding prescribed semantic meanings (e.g., eagle, girl). Our code and demo application are available at https://github.com/ironjr/StreamMultiDiffusion.

StreamMultiDiffusion：基於區域語義控制的即時互動生成

StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

摘要

Support