推論時におけるテキスト条件付き拡散モデルへの画像ガイダンスの注入

要旨

Stable Diffusionなどのテキスト-画像拡散モデルは、テキストから高品質な画像を生成できるが、推論時に再学習を必要とせずにスケッチやスタイルといった視覚的ガイダンスを注入する方法が欠如している。既存手法は計算コストの高いファインチューニングが必要か、テキストプロンプトとの意味的不整合のリスクを伴うスタイル変換技術に依存している。本稿では、Visual Concept Fusion (VCF)を提案する。これは、概念固有の学習を一切行わずに推論時に画像とテキストプロンプトの両方によるデュアル条件付けを可能にする初の手法である。VCFは、CLIP画像特徴量をテキスト埋め込み空間に整合させることで、Stable Diffusionへの視覚概念注入を実現する。VCFは三つの要素から構成される。(1) InfoNCE損失とクロスアテンション再構成損失を用いて画像トークンをテキスト埋め込み多様体に写像する軽量アライナー、(2)テキストと画像の両方の意味を保持する融合戦略、(3)テスト時洗練のためのオプションモジュールであるPrompt-Noise Optimization (PNO)である。実験により、VCFがスタイル、構図、カラーパレットなどの視覚属性を参照画像から転送しつつ、プロンプトへの忠実性を維持することを実証する。定量的結果は、テキスト整合性(CLIPスコア)と視覚的対応性(LPIPS)の間のトレードオフを示し、VCFが参照画像への忠実度においてベースライン手法を上回ることを明らかにする。

English

Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.