Cocktail:混合多模態控制以進行基於文本條件的圖像生成
Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation
June 1, 2023
作者: Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham
cs.AI
摘要
基於文本條件的擴散模型能夠生成具有多樣內容的高保真度圖像。然而,語言表示經常對所想象的目標圖像進行模糊描述,需要引入額外的控制信號以增強文本引導的擴散模型的效力。在這項工作中,我們提出了Cocktail,一個將各種模態混合為一個嵌入的流程,與一個通用的控制網絡(gControlNet)、可控制的歸一化(ControlNorm)和一種空間引導採樣方法相結合,以實現文本條件的擴散模型的多模態和空間細化控制。具體來說,我們引入了一個超網絡gControlNet,專門用於將來自不同模態的控制信號對齊並融入預訓練的擴散模型中。gControlNet能夠接受靈活的模態信號,包括同時接收任何組合的模態信號,或多個模態信號的補充融合。然後,根據我們提出的ControlNorm,將控制信號融合並注入到骨幹模型中。此外,我們先進的空間引導採樣方法有效地將控制信號納入指定區域,從而避免在生成的圖像中出現不需要的對象。我們展示了我們方法在控制各種模態方面的結果,證明了對多個外部信號的高質量合成和忠實度。
English
Text-conditional diffusion models are able to generate high-fidelity images
with diverse contents. However, linguistic representations frequently exhibit
ambiguous descriptions of the envisioned objective imagery, requiring the
incorporation of additional control signals to bolster the efficacy of
text-guided diffusion models. In this work, we propose Cocktail, a pipeline to
mix various modalities into one embedding, amalgamated with a generalized
ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a
spatial guidance sampling method, to actualize multi-modal and
spatially-refined control for text-conditional diffusion models. Specifically,
we introduce a hyper-network gControlNet, dedicated to the alignment and
infusion of the control signals from disparate modalities into the pre-trained
diffusion model. gControlNet is capable of accepting flexible modality signals,
encompassing the simultaneous reception of any combination of modality signals,
or the supplementary fusion of multiple modality signals. The control signals
are then fused and injected into the backbone model according to our proposed
ControlNorm. Furthermore, our advanced spatial guidance sampling methodology
proficiently incorporates the control signal into the designated region,
thereby circumventing the manifestation of undesired objects within the
generated image. We demonstrate the results of our method in controlling
various modalities, proving high-quality synthesis and fidelity to multiple
external signals.