Efficiënte Tekstgestuurde Convolutionele Adapter voor het Diffusiemodel

Samenvatting

Wij introduceren de Nexus Adapters, nieuwe tekstgestuurde efficiënte adapters voor het op diffusie gebaseerde raamwerk voor Structure Preserving Conditional Generation (SPCG). Recent hebben structuurbehoudende methoden veelbelovende resultaten behaald in conditionele beeldgeneratie door een basismodel te gebruiken voor promptconditionering en een adapter voor structuurinvoer, zoals schetsen of dieptekaarten. Deze benaderingen zijn zeer inefficiënt en vereisen soms evenveel parameters in de adapter als in de basisarchitectuur. Het is niet altijd mogelijk om het model te trainen, omdat het diffusiemodel zelf kostbaar is en een verdubbeling van de parameters zeer inefficiënt is. Bij deze benaderingen is de adapter zich niet bewust van de invoerprompt; hij is daarom alleen optimaal voor de structurele invoer, maar niet voor de invoerprompt. Om de bovengenoemde uitdagingen te overwinnen, stelden wij twee efficiënte adapters voor, Nexus Prime en Slim, die worden gestuurd door prompts en structurele invoer. Elk Nexus Blok bevat cross-attention mechanismen om rijke multimodale conditionering mogelijk te maken. Daarom heeft de voorgestelde adapter een beter begrip van de invoerprompt terwijl de structuur behouden blijft. Wij voerden uitgebreide experimenten uit met de voorgestelde modellen en toonden aan dat de Nexus Prime-adapter de prestaties aanzienlijk verbetert, met slechts 8M extra parameters vergeleken met de baseline, T2I-Adapter. Verder introduceerden wij ook een lichtgewicht Nexus Slim-adapter met 18M parameters minder dan de T2I-Adapter, die toch state-of-the-art resultaten behaalde. Code: https://github.com/arya-domain/Nexus-Adapters

English

We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: https://github.com/arya-domain/Nexus-Adapters

Efficiënte Tekstgestuurde Convolutionele Adapter voor het Diffusiemodel

Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Samenvatting

Support