去擴散使文本成為強大的跨模態界面。

摘要

我們展示了文本作為一個強大的跨模態界面。與其依賴深度嵌入來將圖像和語言連接為界面表示，我們的方法將圖像表示為文本，從中我們享受到自然語言固有的可解釋性和靈活性。我們使用一個自編碼器，該編碼器使用預先訓練的文本到圖像擴散模型進行解碼。編碼器被訓練為將輸入圖像轉換為文本，然後將其餵入固定的文本到圖像擴散解碼器以重構原始輸入 -- 這一過程我們稱之為去擴散。實驗驗證了去擴散文本代表圖像的精確性和全面性，使其可以被現成的文本到圖像工具和LLM輕鬆接受，用於各種多模態任務。例如，單個去擴散模型可以泛化為為不同的文本到圖像工具提供可轉移的提示，並且通過僅使用少量示例提示大型語言模型，在開放式視覺語言任務上實現了一個新的最先進水平。

English

We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input -- a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.

去擴散使文本成為強大的跨模態界面。

De-Diffusion Makes Text a Strong Cross-Modal Interface

摘要

Support