Cocktail：混合多模態控制以進行基於文本條件的圖像生成

摘要

基於文本條件的擴散模型能夠生成具有多樣內容的高保真度圖像。然而，語言表示經常對所想象的目標圖像進行模糊描述，需要引入額外的控制信號以增強文本引導的擴散模型的效力。在這項工作中，我們提出了Cocktail，一個將各種模態混合為一個嵌入的流程，與一個通用的控制網絡（gControlNet）、可控制的歸一化（ControlNorm）和一種空間引導採樣方法相結合，以實現文本條件的擴散模型的多模態和空間細化控制。具體來說，我們引入了一個超網絡gControlNet，專門用於將來自不同模態的控制信號對齊並融入預訓練的擴散模型中。gControlNet能夠接受靈活的模態信號，包括同時接收任何組合的模態信號，或多個模態信號的補充融合。然後，根據我們提出的ControlNorm，將控制信號融合並注入到骨幹模型中。此外，我們先進的空間引導採樣方法有效地將控制信號納入指定區域，從而避免在生成的圖像中出現不需要的對象。我們展示了我們方法在控制各種模態方面的結果，證明了對多個外部信號的高質量合成和忠實度。

English

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model. gControlNet is capable of accepting flexible modality signals, encompassing the simultaneous reception of any combination of modality signals, or the supplementary fusion of multiple modality signals. The control signals are then fused and injected into the backbone model according to our proposed ControlNorm. Furthermore, our advanced spatial guidance sampling methodology proficiently incorporates the control signal into the designated region, thereby circumventing the manifestation of undesired objects within the generated image. We demonstrate the results of our method in controlling various modalities, proving high-quality synthesis and fidelity to multiple external signals.

Cocktail：混合多模態控制以進行基於文本條件的圖像生成

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

摘要

Support