去模糊化使文本成为强大的跨模态界面。

摘要

我们展示文本作为一个强大的跨模态接口。与依赖深度嵌入将图像和语言连接作为接口表示不同，我们的方法将图像表示为文本，从中我们获得了自然语言固有的可解释性和灵活性。我们使用一个自编码器，该自编码器使用预训练的文本到图像扩散模型进行解码。编码器被训练为将输入图像转换为文本，然后将其馈送到固定的文本到图像扩散解码器以重建原始输入 -- 这个过程我们称之为去扩散。实验证实了去扩散文本代表图像的精确性和全面性，使其可以被现成的文本到图像工具和LLM用于各种多模态任务。例如，一个单一的去扩散模型可以泛化为为不同的文本到图像工具提供可转移的提示，并且通过简单地用少量示例提示大型语言模型，在开放式视觉语言任务上取得了新的最先进水平。

English

We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input -- a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.

去模糊化使文本成为强大的跨模态界面。

De-Diffusion Makes Text a Strong Cross-Modal Interface

摘要

Support