混合多模态控制以进行文本条件图像生成
Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation
June 1, 2023
作者: Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham
cs.AI
摘要
文本条件扩散模型能够生成内容多样的高保真图像。然而,语言表示经常对所设想的目标图像描述模糊不清,需要引入额外的控制信号以增强文本引导的扩散模型的效力。在这项工作中,我们提出了Cocktail,这是一个将各种模态混合到一个嵌入中的流程,与一个广义ControlNet(gControlNet)、一个可控规范化(ControlNorm)以及一个空间引导采样方法相结合,以实现文本条件扩散模型的多模态和空间精细控制。具体来说,我们引入了一个超网络gControlNet,专门用于将来自不同模态的控制信号与预训练扩散模型融合和融合。gControlNet能够接受灵活的模态信号,包括同时接收任意组合的模态信号或多个模态信号的补充融合。然后,根据我们提出的ControlNorm,将控制信号融合并注入到骨干模型中。此外,我们先进的空间引导采样方法有效地将控制信号纳入指定区域,从而避免在生成的图像中出现不需要的对象。我们展示了我们的方法在控制各种模态方面的结果,证明了高质量的综合和对多个外部信号的保真度。
English
Text-conditional diffusion models are able to generate high-fidelity images
with diverse contents. However, linguistic representations frequently exhibit
ambiguous descriptions of the envisioned objective imagery, requiring the
incorporation of additional control signals to bolster the efficacy of
text-guided diffusion models. In this work, we propose Cocktail, a pipeline to
mix various modalities into one embedding, amalgamated with a generalized
ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a
spatial guidance sampling method, to actualize multi-modal and
spatially-refined control for text-conditional diffusion models. Specifically,
we introduce a hyper-network gControlNet, dedicated to the alignment and
infusion of the control signals from disparate modalities into the pre-trained
diffusion model. gControlNet is capable of accepting flexible modality signals,
encompassing the simultaneous reception of any combination of modality signals,
or the supplementary fusion of multiple modality signals. The control signals
are then fused and injected into the backbone model according to our proposed
ControlNorm. Furthermore, our advanced spatial guidance sampling methodology
proficiently incorporates the control signal into the designated region,
thereby circumventing the manifestation of undesired objects within the
generated image. We demonstrate the results of our method in controlling
various modalities, proving high-quality synthesis and fidelity to multiple
external signals.