ChatPaper.aiChatPaper

使用多模式語言模型生成圖像

Generating Images with Multimodal Language Models

May 26, 2023
作者: Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov
cs.AI

摘要

我們提出了一種方法,將凍結的僅文本大型語言模型(LLMs)與預先訓練的圖像編碼器和解碼器模型融合,通過對它們的嵌入空間進行映射。我們的模型展示了廣泛的多模態能力:圖像檢索、新穎圖像生成和多模態對話。我們的方法是第一個能夠在任意交錯的圖像和文本輸入上進行條件生成一致圖像(和文本)輸出的方法。為了在圖像生成上取得良好性能,我們提出了一個高效的映射網絡,將LLM基於現成的文本到圖像生成模型進行基礎化。該映射網絡將文本的隱藏表示轉換為視覺模型的嵌入空間,使我們能夠利用LLM的強大文本表示來生成視覺輸出。我們的方法在具有更長且更複雜語言的任務上優於基準生成模型。除了新穎的圖像生成,我們的模型還能夠從預定義的數據集中檢索圖像,並在推理時決定是檢索還是生成。這是通過一個學習的決策模塊完成的,該模塊條件於LLM的隱藏表示。我們的模型展示了比以前的多模態語言模型更廣泛的能力範圍。它可以處理圖像和文本輸入,並產生檢索到的圖像、生成的圖像和生成的文本,優於基於非LLM的生成模型在幾個測量上下文相依性的文本到圖像任務中的表現。
English
We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.
PDF72December 15, 2024