DreamRenderer: Beheersing van Multi-Instantie Attribuutcontrole in Grootschalige Tekst-naar-Beeld Modellen

Samenvatting

Beeldgeconditioneerde generatiemethoden, zoals diepte- en canny-geconditioneerde benaderingen, hebben opmerkelijke mogelijkheden getoond voor precieze beeld synthese. Bestaande modellen hebben echter nog steeds moeite om de inhoud van meerdere instanties (of regio's) nauwkeurig te controleren. Zelfs state-of-the-art modellen zoals FLUX en 3DIS kampen met uitdagingen, zoals attribuutlekkage tussen instanties, wat de gebruikerscontrole beperkt. Om deze problemen aan te pakken, introduceren we DreamRenderer, een trainingsvrije benadering gebaseerd op het FLUX-model. DreamRenderer stelt gebruikers in staat om de inhoud van elke instantie te controleren via begrenzingsvakken of maskers, terwijl de algehele visuele harmonie wordt gewaarborgd. We stellen twee belangrijke innovaties voor: 1) Bridge Image Tokens voor Hard Text Attribute Binding, die gerepliceerde beeldtokens gebruikt als brugtokens om ervoor te zorgen dat T5-tekstembeddings, alleen getraind op tekstdata, de juiste visuele attributen binden voor elke instantie tijdens Joint Attention; 2) Hard Image Attribute Binding die alleen wordt toegepast op cruciale lagen. Door onze analyse van FLUX identificeren we de kritieke lagen die verantwoordelijk zijn voor het renderen van instantie-attributen en passen we Hard Image Attribute Binding alleen toe in deze lagen, waarbij we zachte binding gebruiken in de andere. Deze benadering zorgt voor precieze controle terwijl de beeldkwaliteit behouden blijft. Evaluaties op de COCO-POS en COCO-MIG benchmarks tonen aan dat DreamRenderer de Image Success Ratio met 17,7% verbetert ten opzichte van FLUX en de prestaties van layout-naar-beeld modellen zoals GLIGEN en 3DIS met tot 26,8% verhoogt. Projectpagina: https://limuloo.github.io/DreamRenderer/.

English

Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7% over FLUX and enhances the performance of layout-to-image models like GLIGEN and 3DIS by up to 26.8%. Project Page: https://limuloo.github.io/DreamRenderer/.

DreamRenderer: Beheersing van Multi-Instantie Attribuutcontrole in Grootschalige Tekst-naar-Beeld Modellen

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

Samenvatting

Support