StreamMultiDiffusion:基于区域语义控制的实时交互式生成
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control
March 14, 2024
作者: Jaerin Lee, Daniel Sungho Jung, Kanggeon Lee, Kyoung Mu Lee
cs.AI
摘要
扩散模型在文本到图像合成中取得的巨大成功,使其成为图像生成和编辑下一代终端用户应用的有前景的候选者。先前的研究侧重于通过减少推理时间或增加用户互动性来改善扩散模型的可用性,例如允许新的、细粒度的控制,如基于区域的文本提示。然而,我们在实证研究中发现,整合这两个方面的工作并不容易,从而限制了扩散模型的潜力。为了解决这种不兼容性,我们提出了StreamMultiDiffusion,这是第一个实时基于区域的文本到图像生成框架。通过稳定快速推理技术,并将模型重构为一个新提出的多提示流批处理架构,我们实现了比现有解决方案快10倍的全景生成速度,并在单个RTX 2080 Ti GPU上实现了1.57 FPS的基于区域的文本到图像合成生成速度。我们的解决方案开创了一个名为语义调色板的交互式图像生成新范式,可以实时从给定的多个手绘区域生成高质量图像,编码规定的语义含义(例如,鹰、女孩)。我们的代码和演示应用程序可在https://github.com/ironjr/StreamMultiDiffusion 上找到。
English
The enormous success of diffusion models in text-to-image synthesis has made
them promising candidates for the next generation of end-user applications for
image generation and editing. Previous works have focused on improving the
usability of diffusion models by reducing the inference time or increasing user
interactivity by allowing new, fine-grained controls such as region-based text
prompts. However, we empirically find that integrating both branches of works
is nontrivial, limiting the potential of diffusion models. To solve this
incompatibility, we present StreamMultiDiffusion, the first real-time
region-based text-to-image generation framework. By stabilizing fast inference
techniques and restructuring the model into a newly proposed multi-prompt
stream batch architecture, we achieve times 10 faster panorama generation
than existing solutions, and the generation speed of 1.57 FPS in region-based
text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a
new paradigm for interactive image generation named semantic palette, where
high-quality images are generated in real-time from given multiple hand-drawn
regions, encoding prescribed semantic meanings (e.g., eagle, girl). Our code
and demo application are available at
https://github.com/ironjr/StreamMultiDiffusion.Summary
AI-Generated Summary