다중모달 언어 모델을 이용한 이미지 생성

초록

우리는 사전 학습된 이미지 인코더 및 디코더 모델과 고정된 텍스트 전용 대형 언어 모델(LLM)을 임베딩 공간 간 매핑을 통해 융합하는 방법을 제안합니다. 우리 모델은 이미지 검색, 새로운 이미지 생성, 그리고 멀티모달 대화를 포함한 다양한 기능을 보여줍니다. 우리의 접근 방식은 임의로 교차된 이미지와 텍스트 입력을 조건으로 하여 일관된 이미지(및 텍스트) 출력을 생성할 수 있는 최초의 방법입니다. 이미지 생성에서 강력한 성능을 달성하기 위해, 우리는 LLM을 기성 텍스트-이미지 생성 모델에 연결하는 효율적인 매핑 네트워크를 제안합니다. 이 매핑 네트워크는 텍스트의 숨겨진 표현을 시각 모델의 임베딩 공간으로 변환하여, LLM의 강력한 텍스트 표현을 시각적 출력에 활용할 수 있게 합니다. 우리의 접근 방식은 더 길고 복잡한 언어를 포함한 작업에서 기준 생성 모델을 능가합니다. 새로운 이미지 생성 외에도, 우리 모델은 사전 지정된 데이터셋에서 이미지를 검색할 수 있으며, 추론 시점에 검색할지 생성할지를 결정합니다. 이는 LLM의 숨겨진 표현을 조건으로 하는 학습된 결정 모듈을 통해 이루어집니다. 우리 모델은 기존의 멀티모달 언어 모델에 비해 더 넓은 범위의 기능을 보여줍니다. 이미지와 텍스트 입력을 처리하고, 검색된 이미지, 생성된 이미지, 그리고 생성된 텍스트를 출력할 수 있으며, 여러 텍스트-이미지 작업에서 비 LLM 기반 생성 모델을 능가합니다. 이러한 작업들은 문맥 의존성을 측정합니다.

English

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.

다중모달 언어 모델을 이용한 이미지 생성

Generating Images with Multimodal Language Models

초록

Support