混合多模态控制以进行文本条件图像生成

摘要

文本条件扩散模型能够生成内容多样的高保真图像。然而，语言表示经常对所设想的目标图像描述模糊不清，需要引入额外的控制信号以增强文本引导的扩散模型的效力。在这项工作中，我们提出了Cocktail，这是一个将各种模态混合到一个嵌入中的流程，与一个广义ControlNet（gControlNet）、一个可控规范化（ControlNorm）以及一个空间引导采样方法相结合，以实现文本条件扩散模型的多模态和空间精细控制。具体来说，我们引入了一个超网络gControlNet，专门用于将来自不同模态的控制信号与预训练扩散模型融合和融合。gControlNet能够接受灵活的模态信号，包括同时接收任意组合的模态信号或多个模态信号的补充融合。然后，根据我们提出的ControlNorm，将控制信号融合并注入到骨干模型中。此外，我们先进的空间引导采样方法有效地将控制信号纳入指定区域，从而避免在生成的图像中出现不需要的对象。我们展示了我们的方法在控制各种模态方面的结果，证明了高质量的综合和对多个外部信号的保真度。

English

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model. gControlNet is capable of accepting flexible modality signals, encompassing the simultaneous reception of any combination of modality signals, or the supplementary fusion of multiple modality signals. The control signals are then fused and injected into the backbone model according to our proposed ControlNorm. Furthermore, our advanced spatial guidance sampling methodology proficiently incorporates the control signal into the designated region, thereby circumventing the manifestation of undesired objects within the generated image. We demonstrate the results of our method in controlling various modalities, proving high-quality synthesis and fidelity to multiple external signals.

混合多模态控制以进行文本条件图像生成

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

摘要

Support