正確著色:連接感知色彩空間與文本嵌入以提升擴散生成效果
Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation
September 12, 2025
作者: Sung-Lin Tsai, Bo-Lun Huang, Yu Ting Shen, Cheng Yu Yeo, Chiang Tseng, Bo-Kai Ruan, Wen-Sheng Lien, Hong-Han Shuai
cs.AI
摘要
在文本到圖像(T2I)生成中,精確的色彩對齊對於時尚、產品視覺化和室內設計等應用至關重要。然而,當前的擴散模型在處理細微和複合色彩詞彙(如蒂芙尼藍、萊姆綠、熱粉紅)時往往力不從心,生成的圖像常與人類意圖不符。現有方法依賴於交叉注意力操控、參考圖像或微調,但未能系統性地解決模糊的色彩描述。為在提示模糊的情況下精確渲染色彩,我們提出了一種無需訓練的框架,該框架通過利用大型語言模型(LLM)來消除色彩相關提示的歧義,並直接在文本嵌入空間中指導色彩混合操作,從而提升色彩保真度。我們的方法首先使用大型語言模型(LLM)來解析文本提示中的模糊色彩詞彙,然後根據這些色彩詞彙在CIELAB色彩空間中的空間關係來精煉文本嵌入。與先前方法不同,我們的方法無需額外訓練或外部參考圖像即可提高色彩準確性。實驗結果表明,我們的框架在不影響圖像質量的情況下改善了色彩對齊,彌合了文本語義與視覺生成之間的差距。
English
Accurate color alignment in text-to-image (T2I) generation is critical for
applications such as fashion, product visualization, and interior design, yet
current diffusion models struggle with nuanced and compound color terms (e.g.,
Tiffany blue, lime green, hot pink), often producing images that are misaligned
with human intent. Existing approaches rely on cross-attention manipulation,
reference images, or fine-tuning but fail to systematically resolve ambiguous
color descriptions. To precisely render colors under prompt ambiguity, we
propose a training-free framework that enhances color fidelity by leveraging a
large language model (LLM) to disambiguate color-related prompts and guiding
color blending operations directly in the text embedding space. Our method
first employs a large language model (LLM) to resolve ambiguous color terms in
the text prompt, and then refines the text embeddings based on the spatial
relationships of the resulting color terms in the CIELAB color space. Unlike
prior methods, our approach improves color accuracy without requiring
additional training or external reference images. Experimental results
demonstrate that our framework improves color alignment without compromising
image quality, bridging the gap between text semantics and visual generation.