Effizienter textgeführter Konvolutionsadapter für das Diffusionsmodell

Zusammenfassung

Wir stellen die Nexus-Adapter vor, neuartige textgesteuerte effiziente Adapter für das diffusionsbasierte Framework zur strukturerhaltenden bedingten Generierung (Structure Preserving Conditional Generation, SPCG). Kürzlich haben strukturerhaltende Methoden vielversprechende Ergebnisse bei der bedingten Bildgenerierung erzielt, indem sie ein Basismodell für die Prompt-Konditionierung und einen Adapter für Struktureingaben wie Skizzen oder Tiefenkarten verwenden. Diese Ansätze sind jedoch sehr ineffizient und erfordern manchmal ebenso viele Parameter im Adapter wie in der Basisarchitektur. Das Training des Modells ist nicht immer möglich, da das Diffusionsmodell selbst rechenintensiv ist und eine Verdopplung der Parameter höchst ineffizient ist. Bei diesen Ansätzen berücksichtigt der Adapter den Eingabe-Prompt nicht; daher ist er optimal für die Struktureingabe, aber nicht für den Eingabe-Prompt. Um diese Herausforderungen zu bewältigen, schlagen wir zwei effiziente Adapter vor, Nexus Prime und Slim, die durch Prompts und Struktureingaben gesteuert werden. Jeder Nexus-Block integriert Cross-Attention-Mechanismen, um eine umfassende multimodale Konditionierung zu ermöglichen. Dadurch versteht der vorgeschlagene Adapter den Eingabe-Prompt besser, während die Struktur erhalten bleibt. Wir führten umfangreiche Experimente mit den vorgeschlagenen Modellen durch und zeigten, dass der Nexus-Prime-Adapter die Leistung erheblich verbessert und im Vergleich zum Baseline-Modell T2I-Adapter nur 8 Mio. zusätzliche Parameter benötigt. Darüber hinaus stellten wir einen leichtgewichtigen Nexus-Slim-Adapter mit 18 Mio. Parametern weniger als der T2I-Adapter vor, der dennoch state-of-the-art Ergebnisse erzielte. Code: https://github.com/arya-domain/Nexus-Adapters

English

We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: https://github.com/arya-domain/Nexus-Adapters

Effizienter textgeführter Konvolutionsadapter für das Diffusionsmodell

Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Zusammenfassung

Support