在推理时将图像引导注入文本条件扩散模型

摘要

像稳定扩散模型（Stable Diffusion）这样的文本到图像扩散模型虽然能根据文本生成高质量图像，但在推理过程中缺乏无需重新训练即可注入视觉引导（如草图、风格）的能力。现有方法要么需要计算成本高昂的微调，要么依赖可能造成与文本提示语义错位的风格迁移技术。我们提出视觉概念融合（Visual Concept Fusion, VCF），这是首个在推理过程中无需任何特定概念训练即可对图像和文本提示进行双重条件约束的方法。VCF通过将CLIP图像特征与文本嵌入空间对齐，实现向稳定扩散模型注入视觉概念。VCF包含三个组件：（1）一个轻量级对齐器，利用InfoNCE和交叉注意力重建损失将图像令牌映射到文本嵌入流形；（2）一种保留文本与视觉语义的融合策略；（3）一个可选的提示噪声优化（Prompt-Noise Optimization, PNO）模块，用于测试时精细化处理。实验表明，VCF成功从参考图像迁移了风格、构图和配色等视觉属性，同时保持对提示的遵循。量化结果展示了文本对齐（CLIP评分）与视觉对应（LPIPS）之间的权衡，且VCF在参考保真度上优于基线方法。

English

Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.