Injectie van beeldgeleiding in tekstgeconditioneerde diffusiemodellen tijdens inferentie

Samenvatting

Tekst-naar-beeld diffusiemodellen zoals Stable Diffusion genereren hoogwaardige afbeeldingen vanuit tekst, maar missen een manier om visuele sturing (bijv. schetsen, stijlen) tijdens inferentie te injecteren zonder hertraining. Bestaande methoden vereisen ofwel rekenintensieve finetuning of vertrouwen op stijloverdrachttechnieken die het risico lopen op semantische misalignatie met tekstuele prompts. Wij introduceren Visual Concept Fusion (VCF), de eerste methode die dubbele conditionering biedt op zowel een afbeelding als een tekstprompt tijdens inferentie, zonder enige conceptspecifieke training. VCF maakt visuele conceptinjectie in Stable Diffusion mogelijk door CLIP-beeldkenmerken uit te lijnen met de tekstinbeddingsruimte. VCF bestaat uit drie componenten: (1) een lichtgewicht aligner die beeldtokens naar de tekstinbeddingsmanifold in kaart brengt met behulp van InfoNCE- en cross-attention reconstructieverliezen, (2) een fusiestrategie die zowel tekstuele als visuele semantiek behoudt, en (3) een optionele Prompt-Noise Optimization (PNO)-module voor testtijdverfijning. Onze experimenten tonen aan dat VCF met succes visuele attributen zoals stijl, compositie en kleurenpalet van referentiebeelden overdraagt, terwijl de trouw aan de prompt behouden blijft. Kwantitatieve resultaten tonen een afweging aan tussen tekstalignatie (CLIP-score) en visuele correspondentie (LPIPS), waarbij VCF de basislijnen overtreft in referentietrouw.

English

Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.