Adattatore Convoluzionale Guidato da Testo Efficiente per il Modello di Diffusione

Abstract

Introduciamo i Nexus Adapters, innovativi adattatori efficienti guidati da testo per il framework basato su diffusione per la Generazione Condizionale con Preservazione della Struttura (SPCG). Recentemente, i metodi di preservazione della struttura hanno ottenuto risultati promettenti nella generazione condizionale di immagini utilizzando un modello base per il condizionamento tramite prompt e un adattatore per l'input strutturale, come schizzi o mappe di profondità. Questi approcci sono altamente inefficienti e a volte richiedono un numero di parametri nell'adattatore pari a quello dell'architettura base. Non è sempre possibile addestrare il modello poiché il modello di diffusione è già di per sé costoso, e raddoppiare i parametri è estremamente inefficiente. In questi approcci, l'adattatore non è consapevole del prompt di input; pertanto, è ottimale solo per l'input strutturale ma non per il prompt di input. Per superare le suddette sfide, abbiamo proposto due adattatori efficienti, Nexus Prime e Slim, guidati rispettivamente dai prompt e dagli input strutturali. Ogni Nexus Block incorpora meccanismi di cross-attention per abilitare un ricco condizionamento multimodale. Di conseguenza, l'adattatore proposto ha una migliore comprensione del prompt di input preservando al contempo la struttura. Abbiamo condotto esperimenti estesi sui modelli proposti e dimostrato che l'adattatore Nexus Prime migliora significativamente le prestazioni, richiedendo solo 8 milioni di parametri aggiuntivi rispetto al baseline, T2I-Adapter. Inoltre, abbiamo introdotto anche un adattatore leggero, Nexus Slim, con 18 milioni di parametri in meno rispetto a T2I-Adapter, che ha comunque raggiunto risultati all'avanguardia. Codice: https://github.com/arya-domain/Nexus-Adapters

English

We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: https://github.com/arya-domain/Nexus-Adapters

Adattatore Convoluzionale Guidato da Testo Efficiente per il Modello di Diffusione

Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Abstract

Support