文字轉圖像模型對文字編碼器的依賴比你想像的少
Text-to-Image Models Need Less from Text Encoders Than You Think
June 2, 2026
作者: Nurit Spingarn, Noa Cohen, Tamar Rott Shaham, Tomer Michaeli
cs.AI
摘要
文本到图像模型依赖文本提示作为人类意图的主要交互界面。提示通过文本编码器编码为嵌入,从而条件化图像生成过程。除了单个词元的含义外,文本嵌入还编码了整个提示的上下文信息,例如组合性和属性绑定。然而,图像模型是否实际利用了这些更丰富的信息仍是一个未被充分探究的问题。本文旨在探讨:文本表示的哪些方面对图像生成至关重要?我们证明,基于扩散变换器的文本到图像模型通常只依赖文本表示的两个相对简单的方面:(i) 将相邻词元合并为单词表示(针对跨多个词元的单词),以及 (ii) 词序,它由文本编码器的位置嵌入所印刻。为了证明这一点,我们构建了一种新的文本嵌入,它仅编码单个单词的含义和顺序,但缺乏关于整个提示的任何上下文信息。我们发现,这种带有位置标记的词袋表示足以成功引导图像生成,其视觉质量和文本忠实度与使用完整文本嵌入引导的生成相当。这表明,与普遍认知相反,文本到图像模型通常并不利用文本嵌入中除单词含义和词序之外的丰富信息。相反,复杂语言结构的解码是由图像模型自身完成的。项目网页:https://nsping13.github.io/contextless-TTI/
English
Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: https://nsping13.github.io/contextless-TTI/