VL-GPT: 視覚と言語の理解と生成のための生成型事前学習トランスフォーマー

要旨

本研究では、視覚と言語のデータを同時に認識・生成する能力に優れたトランスフォーマーモデルであるVision-Language Generative Pre-trained Transformer（VL-GPT）を紹介します。VL-GPTは、シンプルな自己回帰目的関数を用いることで、画像とテキストの両モダリティに対する統一的な事前学習アプローチを実現し、言語モデルがテキストを処理するのと同様に、画像とテキストをシームレスに処理できるようにします。これを実現するため、我々はまず、生の画像を連続的な埋め込みのシーケンスに変換し、それに応じて再構築するために特別に設計された、視覚データ向けの新しい画像トークナイザー・デトークナイザーフレームワークを提案します。このフレームワークは、既存のテキストトークナイザーおよびデトークナイザーと組み合わせることで、画像とテキストが交互に現れるデータをマルチモーダルシーケンスにエンコードし、それをトランスフォーマーモデルに入力できるようにします。その結果、VL-GPTは、統一的な自己回帰目的関数（すなわち、次のトークンの予測）を用いて、マルチモーダルコーパスに対する大規模な事前学習を実行できます。事前学習を完了したVL-GPTは、画像キャプショニング、視覚的質問応答、テキストから画像への生成など、多様な視覚と言語の理解および生成タスクにおいて、優れたゼロショットおよび少数ショットの性能を示します。さらに、事前学習済みモデルは、マルチモーダルプロンプトが与えられた場合に、コンテキスト内学習能力を保持します。我々はさらに、VL-GPTに対して指示チューニングを行い、マルチモーダルアシスタンスとしての卓越した可能性を強調します。ソースコードとモデルウェイトは公開される予定です。

English

In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as a language model processes text. To accomplish this, we initially propose a novel image tokenizer-detokenizer framework for visual data, specifically designed to transform raw images into a sequence of continuous embeddings and reconstruct them accordingly. In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model. Consequently, VL-GPT can perform large-scale pre-training on multimodal corpora utilizing a unified auto-regressive objective (i.e., next-token prediction). Upon completion of pre-training, VL-GPT exhibits remarkable zero-shot and few-shot performance across a diverse range of vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, and more. Additionally, the pre-trained model retrains in-context learning capabilities when provided with multimodal prompts. We further conduct instruction tuning on our VL-GPT, highlighting its exceptional potential for multimodal assistance. The source code and model weights shall be released.

VL-GPT: 視覚と言語の理解と生成のための生成型事前学習トランスフォーマー

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

要旨

Support