ChatPaper.aiChatPaper

VL-GPT:一種用於視覺和語言理解與生成的生成式預訓練Transformer

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

December 14, 2023
作者: Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan
cs.AI

摘要

在這份工作中,我們介紹了視覺-語言生成預訓練Transformer(VL-GPT),這是一種能夠同時感知和生成視覺和語言數據的Transformer模型。VL-GPT通過採用直觀的自回歸目標,實現了對圖像和文本模態的統一預訓練方法,從而使模型能夠像語言模型處理文本一樣無縫地處理圖像和文本。為了實現這一目標,我們首次提出了一種新穎的圖像分詞-去分詞框架,專門設計用於將原始圖像轉換為連續嵌入序列並相應地重構它們。結合現有的文本分詞器和去分詞器,這個框架允許將交錯的圖像-文本數據編碼為多模態序列,隨後可以輸入到Transformer模型中。因此,VL-GPT可以在多模態語料庫上進行大規模的預訓練,利用統一的自回歸目標(即下一個標記預測)。完成預訓練後,VL-GPT在各種視覺和語言理解和生成任務中展現出卓越的零樣本和少樣本性能,包括圖像標題生成、視覺問答、文本到圖像生成等。此外,當提供多模態提示時,預訓練模型還可以重新進行上下文學習能力的調整。我們進一步對VL-GPT進行指令微調,突顯其在多模態輔助方面的卓越潛力。源代碼和模型權重將被釋出。
English
In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as a language model processes text. To accomplish this, we initially propose a novel image tokenizer-detokenizer framework for visual data, specifically designed to transform raw images into a sequence of continuous embeddings and reconstruct them accordingly. In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model. Consequently, VL-GPT can perform large-scale pre-training on multimodal corpora utilizing a unified auto-regressive objective (i.e., next-token prediction). Upon completion of pre-training, VL-GPT exhibits remarkable zero-shot and few-shot performance across a diverse range of vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, and more. Additionally, the pre-trained model retrains in-context learning capabilities when provided with multimodal prompts. We further conduct instruction tuning on our VL-GPT, highlighting its exceptional potential for multimodal assistance. The source code and model weights shall be released.
PDF101December 15, 2024