텍스트-이미지 모델은 텍스트 인코더로부터 생각보다 덜 필요로 한다

초록

텍스트-이미지 모델은 인간의 의도를 전달하는 주요 인터페이스로 텍스트 프롬프트를 사용한다. 프롬프트는 텍스트 인코더에 의해 임베딩으로 인코딩되며, 이 임베딩은 이미지 생성 과정을 조건화한다. 개별 토큰 의미를 넘어, 텍스트 임베딩은 구성성 및 속성 결합과 같은 전체 프롬프트에 걸친 맥락 정보를 인코딩한다. 그러나 이미지 모델이 실제로 이와 같은 풍부한 정보를 활용하는지는 아직 충분히 탐구되지 않았다. 본 연구에서는 '텍스트 표현의 어떤 측면이 이미지 생성에 필수적인가?'라는 질문을 다룬다. 우리는 텍스트-이미지 확산 변환기 기반 모델이 일반적으로 텍스트 표현의 비교적 단순한 두 가지 측면에만 의존함을 보여준다: (i) 여러 토큰에 걸친 단어의 경우 인접 토큰을 단어 표현으로 병합하는 것, (ii) 텍스트 인코더의 위치 임베딩에 의해 각인된 단어 순서이다. 이를 입증하기 위해, 우리는 개별 단어 의미와 순서만을 인코딩하고 전체 프롬프트에 대한 맥락 정보는 전혀 포함하지 않는 새로운 텍스트 임베딩을 구성한다. 이러한 위치 태깅된 단어 가방 표현이 이미지 생성을 성공적으로 안내할 수 있으며, 시각적 품질과 텍스트 충실도에서 전체 텍스트 임베딩 기반 생성과 동등한 성능을 보임을 발견했다. 이는 일반적인 믿음과 달리, 텍스트-이미지 모델이 개별 단어 의미와 단어 순서를 넘어 텍스트 임베딩에 인코딩된 풍부한 정보를 자주 사용하지 않는다는 것을 보여준다. 대신, 복잡한 언어 구조의 해독은 이미지 모델 자체에 의해 수행된다. 프로젝트 웹페이지: https://nsping13.github.io/contextless-TTI/

English

Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: https://nsping13.github.io/contextless-TTI/