ChatPaper.aiChatPaper

文本到图像模型对文本编码器的需求比你想象的要少

Text-to-Image Models Need Less from Text Encoders Than You Think

June 2, 2026
作者: Nurit Spingarn, Noa Cohen, Tamar Rott Shaham, Tomer Michaeli
cs.AI

摘要

文本到图像模型依赖文本提示作为其主要的人类意图接口。提示通过文本编码器编码为嵌入向量,从而对图像生成过程施加条件。除了单个标记的含义外,文本嵌入还编码了整个提示中的上下文信息,例如组合性和属性绑定。然而,图像模型是否真正利用了这些更丰富的信息仍未被充分探索。在此,我们探讨的问题是:文本表示的哪些方面对图像生成至关重要?我们证明,基于扩散变换器的文本到图像模型通常仅依赖于文本表示中两个相对直接的方面:(i) 将相邻标记合并为单词表示(适用于跨多个标记的单词),以及 (ii) 单词顺序,该顺序由文本编码器的位置嵌入印刻。为了证明这一点,我们构建了一种新的文本嵌入,它仅编码单个单词的含义和顺序,但缺乏关于整个提示的任何上下文信息。我们发现,这种带有位置标记的词袋表示足以成功引导图像生成,其视觉质量和文本忠实度与完整文本嵌入引导的生成相当。这表明,与普遍看法相反,文本到图像模型通常并不使用文本嵌入中超出单个单词含义和单词顺序之外的丰富信息。相反,复杂语言结构的解码是由图像模型本身完成的。项目网页:https://nsping13.github.io/contextless-TTI/
English
Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: https://nsping13.github.io/contextless-TTI/