VL-GPT:用于视觉和语言理解与生成的预训练生成式Transformer
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
December 14, 2023
作者: Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan
cs.AI
摘要
在这项工作中,我们介绍了Vision-Language Generative Pre-trained Transformer(VL-GPT),这是一种能够同时感知和生成视觉和语言数据的Transformer模型。VL-GPT通过采用直观的自回归目标实现了图像和文本两种模态的统一预训练方法,从而使模型能够像处理文本的语言模型一样无缝地处理图像和文本。为了实现这一目标,我们首先提出了一种新颖的图像分词器-去分词器框架,专门设计用于将原始图像转换为连续嵌入序列并相应地重构它们。结合现有的文本分词器和去分词器,这一框架允许将交织的图像-文本数据编码为多模态序列,随后可以输入到Transformer模型中。因此,VL-GPT能够利用统一的自回归目标(即下一个标记预测)在多模态语料库上进行大规模预训练。完成预训练后,VL-GPT在各种视觉和语言理解与生成任务中表现出卓越的零样本和少样本性能,包括图像字幕生成、视觉问答、文本到图像生成等。此外,当提供多模态提示时,预训练模型可以重新学习上下文学习能力。我们进一步对VL-GPT进行指令微调,突出其在多模态辅助方面的卓越潜力。源代码和模型权重将会发布。
English
In this work, we introduce Vision-Language Generative Pre-trained Transformer
(VL-GPT), a transformer model proficient at concurrently perceiving and
generating visual and linguistic data. VL-GPT achieves a unified pre-training
approach for both image and text modalities by employing a straightforward
auto-regressive objective, thereby enabling the model to process image and text
as seamlessly as a language model processes text. To accomplish this, we
initially propose a novel image tokenizer-detokenizer framework for visual
data, specifically designed to transform raw images into a sequence of
continuous embeddings and reconstruct them accordingly. In combination with the
existing text tokenizer and detokenizer, this framework allows for the encoding
of interleaved image-text data into a multimodal sequence, which can
subsequently be fed into the transformer model. Consequently, VL-GPT can
perform large-scale pre-training on multimodal corpora utilizing a unified
auto-regressive objective (i.e., next-token prediction). Upon completion of
pre-training, VL-GPT exhibits remarkable zero-shot and few-shot performance
across a diverse range of vision and language understanding and generation
tasks, including image captioning, visual question answering, text-to-image
generation, and more. Additionally, the pre-trained model retrains in-context
learning capabilities when provided with multimodal prompts. We further conduct
instruction tuning on our VL-GPT, highlighting its exceptional potential for
multimodal assistance. The source code and model weights shall be released.