在推理階段將圖像引導注入文本條件擴散模型

摘要

像Stable Diffusion這類文字轉圖像擴散模型能從文字生成高品質圖像，但在推理過程中缺乏無需重新訓練即可注入視覺引導（例如草圖、風格）之方法。現有方法若非需計算成本高昂的微調，便是依賴風格轉換技術，而此類技術可能造成與文字提示之間的語義偏差。我們提出視覺概念融合（Visual Concept Fusion, VCF），這是首個在推理時無需任何特定概念訓練，即能同時對圖像與文字提示進行雙重條件化的方法。VCF透過將CLIP圖像特徵對齊至文字嵌入空間，實現將視覺概念注入Stable Diffusion。VCF包含三個組成部分：(1) 輕量級對齊器，利用InfoNCE與交叉注意力重建損失，將圖像標記映射至文字嵌入流形；(2) 保留文字與視覺語義之融合策略；(3) 可選的提示噪聲優化（Prompt-Noise Optimization, PNO）模組，用於測試時之修正。實驗結果顯示，VCF能成功從參考圖像轉移風格、構圖及色調等視覺屬性，同時維持對提示之遵從性。量化結果顯示文字對齊（CLIP分數）與視覺對應（LPIPS）之間存在權衡關係，而VCF在參考忠實度方面優於各基線方法。

English

Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.