칵테일: 텍스트 조건부 이미지 생성을 위한 다중 모달리티 제어 혼합

초록

텍스트 조건부 확산 모델은 다양한 내용을 담은 고해상도 이미지를 생성할 수 있다. 그러나 언어적 표현은 종종 목표로 하는 이미지에 대해 모호한 설명을 보이기 때문에, 텍스트 기반 확산 모델의 효율성을 높이기 위해 추가적인 제어 신호의 통합이 필요하다. 본 연구에서는 다양한 모달리티를 하나의 임베딩으로 혼합하는 파이프라인인 Cocktail을 제안한다. 이는 일반화된 ControlNet(gControlNet), 제어 가능한 정규화(ControlNorm), 그리고 공간적 가이던스 샘플링 방법과 결합되어 텍스트 조건부 확산 모델에 대한 다중 모달리티 및 공간적으로 정제된 제어를 실현한다. 구체적으로, 우리는 사전 학습된 확산 모델에 다양한 모달리티의 제어 신호를 정렬 및 주입하기 위한 하이퍼 네트워크인 gControlNet을 소개한다. gControlNet은 유연한 모달리티 신호를 수용할 수 있으며, 모달리티 신호의 임의의 조합을 동시에 수신하거나 여러 모달리티 신호의 추가적 융합을 포함한다. 제어 신호는 제안된 ControlNorm에 따라 백본 모델에 융합 및 주입된다. 또한, 우리의 고급 공간적 가이던스 샘플링 방법론은 제어 신호를 지정된 영역에 효과적으로 통합함으로써 생성된 이미지 내에서 원치 않는 객체의 출현을 방지한다. 우리는 다양한 모달리티를 제어하는 방법의 결과를 보여주며, 고품질 합성과 다중 외부 신호에 대한 충실도를 입증한다.

English

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model. gControlNet is capable of accepting flexible modality signals, encompassing the simultaneous reception of any combination of modality signals, or the supplementary fusion of multiple modality signals. The control signals are then fused and injected into the backbone model according to our proposed ControlNorm. Furthermore, our advanced spatial guidance sampling methodology proficiently incorporates the control signal into the designated region, thereby circumventing the manifestation of undesired objects within the generated image. We demonstrate the results of our method in controlling various modalities, proving high-quality synthesis and fidelity to multiple external signals.

칵테일: 텍스트 조건부 이미지 생성을 위한 다중 모달리티 제어 혼합

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

초록

Support