De-Diffusion은 텍스트를 강력한 교차 모달 인터페이스로 만듭니다

초록

우리는 텍스트가 강력한 교차 모달 인터페이스임을 입증합니다. 이미지와 언어를 연결하는 인터페이스 표현으로 깊은 임베딩에 의존하는 대신, 우리의 접근 방식은 이미지를 텍스트로 표현함으로써 자연어에 내재된 해석 가능성과 유연성을 활용합니다. 우리는 디코딩을 위해 사전 훈련된 텍스트-이미지 확산 모델을 사용하는 오토인코더를 채택합니다. 인코더는 입력 이미지를 텍스트로 변환하도록 훈련되며, 이 텍스트는 고정된 텍스트-이미지 확산 디코더에 입력되어 원본 입력을 재구성합니다. 이 과정을 우리는 '디-확산(De-Diffusion)'이라고 명명합니다. 실험을 통해 디-확산 텍스트가 이미지를 정확하고 포괄적으로 표현할 수 있음을 검증했으며, 이를 통해 다양한 다중 모달 작업을 위해 기성 텍스트-이미지 도구와 대형 언어 모델(LLM)에 쉽게 활용할 수 있습니다. 예를 들어, 단일 디-확산 모델은 다양한 텍스트-이미지 도구에 대한 전이 가능한 프롬프트를 제공할 수 있으며, 소수의 예제로 대형 언어 모델을 프롬프팅하여 개방형 시각-언어 작업에서 새로운 최첨단 성과를 달성합니다.

English

We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input -- a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.

De-Diffusion은 텍스트를 강력한 교차 모달 인터페이스로 만듭니다

De-Diffusion Makes Text a Strong Cross-Modal Interface

초록

Support