テキスト・画像生成モデルがテキストエンコーダに求めるものは、想像以上に少ない

要旨

テキストから画像へのモデルは、人間の意図を伝える主要なインターフェースとしてテキストプロンプトに依存している。プロンプトはテキストエンコーダによって埋め込みに符号化され、画像生成プロセスを条件付ける。個々のトークンの意味を超えて、テキスト埋め込みはプロンプト全体の文脈情報（構成性や属性結合など）を符号化する。しかし、画像モデルが実際にこの豊かな情報を活用しているかどうかは十分に調査されていない。本稿では、「画像生成にとってテキスト表現のどの側面が本質的か？」という問いに取り組む。我々は、テキストから画像への拡散トランスフォーマーベースモデルが、一般にテキスト表現の比較的単純な二つの側面のみに依存していることを示す：(i) 複数のトークンにまたがる単語について、隣接するトークンを単語表現に統合すること、(ii) テキストエンコーダの位置埋め込みによって刻印される語順である。これを示すために、個々の単語の意味と順序のみを符号化し、プロンプト全体に関する文脈情報を欠いた新しいテキスト埋め込みを構築した。この位置タグ付き単語のバッグ表現が画像生成を成功裡に導くのに十分であり、視覚品質とテキスト忠実度において完全なテキスト埋め込みによる生成と同等の結果を達成することを発見した。これは、一般的な認識に反して、テキストから画像へのモデルは多くの場合、個々の単語の意味と語順を超えたテキスト埋め込みに符号化された豊かな情報を使用しておらず、代わりに複雑な言語構造の解読は画像モデル自身によって行われていることを示している。プロジェクトウェブページ: https://nsping13.github.io/contextless-TTI/

English

Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: https://nsping13.github.io/contextless-TTI/