《林布蘭的牛》——解析文本到圖像模型中的藝術提示解讀
The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models
July 31, 2025
作者: Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti
cs.AI
摘要
文本到圖像擴散模型在從數十億張圖像(包括流行藝術作品)中學習生成藝術內容方面展現了顯著的能力。然而,這些模型內部如何表示概念(如繪畫中的內容和風格)的基本問題仍未得到探索。傳統的計算機視覺假設內容和風格是正交的,但擴散模型在訓練過程中並未獲得關於這種區分的明確指導。在本研究中,我們探討了基於變壓器的文本到圖像擴散模型在生成藝術作品時如何編碼內容和風格概念。我們利用交叉注意力熱圖將生成圖像中的像素歸因於特定的提示詞,使我們能夠分離受內容描述詞和風格描述詞影響的圖像區域。我們的研究結果表明,擴散模型根據具體的藝術提示和風格要求,展現出不同程度的內容-風格分離。在許多情況下,內容詞主要影響與物體相關的區域,而風格詞則影響背景和紋理區域,這表明了一種對內容-風格區分的新興理解。這些見解有助於我們理解大規模生成模型在沒有明確監督的情況下如何內部表示複雜的藝術概念。我們在 https://github.com/umilISLab/artistic-prompt-interpretation 上分享了代碼和數據集,以及一個用於可視化注意力圖的探索工具。
English
Text-to-image diffusion models have demonstrated remarkable capabilities in
generating artistic content by learning from billions of images, including
popular artworks. However, the fundamental question of how these models
internally represent concepts, such as content and style in paintings, remains
unexplored. Traditional computer vision assumes content and style are
orthogonal, but diffusion models receive no explicit guidance about this
distinction during training. In this work, we investigate how transformer-based
text-to-image diffusion models encode content and style concepts when
generating artworks. We leverage cross-attention heatmaps to attribute pixels
in generated images to specific prompt tokens, enabling us to isolate image
regions influenced by content-describing versus style-describing tokens. Our
findings reveal that diffusion models demonstrate varying degrees of
content-style separation depending on the specific artistic prompt and style
requested. In many cases, content tokens primarily influence object-related
regions while style tokens affect background and texture areas, suggesting an
emergent understanding of the content-style distinction. These insights
contribute to our understanding of how large-scale generative models internally
represent complex artistic concepts without explicit supervision. We share the
code and dataset, together with an exploratory tool for visualizing attention
maps at https://github.com/umilISLab/artistic-prompt-interpretation.