伦勃朗的牛——解析文本到图像模型中的艺术提示理解

摘要

文本到图像扩散模型通过学习数十亿张图像（包括流行艺术作品），在生成艺术内容方面展现了卓越的能力。然而，这些模型如何在内部表示概念，如绘画中的内容和风格，这一根本问题仍未得到探索。传统计算机视觉假设内容和风格是正交的，但扩散模型在训练过程中并未获得关于这种区分的明确指导。在本研究中，我们探讨了基于Transformer的文本到图像扩散模型在生成艺术作品时如何编码内容和风格概念。我们利用交叉注意力热图将生成图像中的像素归因于特定的提示词，使我们能够分离出受内容描述词和风格描述词影响的图像区域。我们的研究结果表明，扩散模型根据具体的艺术提示和风格要求，展现出不同程度的内容-风格分离。在许多情况下，内容词主要影响与物体相关的区域，而风格词则影响背景和纹理区域，这表明模型对内容-风格区分有了一种自发的理解。这些见解有助于我们理解大规模生成模型在没有明确监督的情况下，如何在内部表示复杂的艺术概念。我们在https://github.com/umilISLab/artistic-prompt-interpretation上分享了代码和数据集，以及一个用于可视化注意力图的探索工具。

English

Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

伦勃朗的牛——解析文本到图像模型中的艺术提示理解

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

摘要

Support