マルチモーダル言語モデルによる画像生成

要旨

我々は、凍結されたテキスト専用の大規模言語モデル（LLM）と事前学習済みの画像エンコーダおよびデコーダモデルを、それらの埋め込み空間間のマッピングによって融合する手法を提案する。本モデルは、画像検索、新規画像生成、マルチモーダル対話など、幅広いマルチモーダル能力を実証する。我々のアプローチは、任意にインターリーブされた画像とテキスト入力を条件として、一貫性のある画像（およびテキスト）出力を生成する初めての手法である。画像生成において強力な性能を達成するために、LLMを既存のテキストから画像生成モデルに接続する効率的なマッピングネットワークを提案する。このマッピングネットワークは、テキストの隠れ表現を視覚モデルの埋め込み空間に変換し、LLMの強力なテキスト表現を視覚出力に活用することを可能にする。我々のアプローチは、より長く複雑な言語タスクにおいて、ベースライン生成モデルを上回る性能を示す。新規画像生成に加えて、本モデルは事前に指定されたデータセットからの画像検索も可能であり、推論時に検索するか生成するかを決定する。これは、LLMの隠れ表現を条件とする学習済みの決定モジュールによって行われる。本モデルは、従来のマルチモーダル言語モデルと比較して、より広範な能力を有する。画像とテキスト入力を処理し、検索された画像、生成された画像、および生成されたテキストを出力することができ、文脈依存性を測定するいくつかのテキストから画像タスクにおいて、非LLMベースの生成モデルを上回る性能を示す。

English

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.

マルチモーダル言語モデルによる画像生成

Generating Images with Multimodal Language Models

要旨

Support