マルチモダリティにおける生成的事前学習

要旨

本論文では、Transformerベースのマルチモーダル基盤モデルであるEmuを紹介する。このモデルは、マルチモーダルな文脈においてシームレスに画像とテキストを生成することができる。このオムニボア（何でも取り込む）モデルは、単一モダリティまたはマルチモーダルなデータ入力（例えば、画像、テキスト、ビデオが交互に現れるデータ）を区別なく受け入れ、一つのモデルで全てを処理する自己回帰型のトレーニングプロセスを経る。まず、視覚信号が埋め込みにエンコードされ、テキストトークンと共に交互に入力シーケンスを形成する。Emuは、マルチモーダルシーケンスにおいて次のテキストトークンを分類するか、次の視覚埋め込みを回帰するという統一された目的でエンドツーエンドでトレーニングされる。この汎用性の高いマルチモーダル性により、フレームとテキストが交互に現れるビデオ、画像とテキストが交互に現れるウェブページ、ウェブスケールの画像-テキストペアやビデオ-テキストペアなど、多様な事前学習データソースの大規模な探索が可能となる。Emuは、画像からテキスト、テキストから画像のタスクの両方に対応する汎用マルチモーダルインターフェースとして機能し、文脈内での画像とテキストの生成をサポートする。画像キャプション生成、視覚的質問応答、ビデオ質問応答、テキストから画像生成など、幅広いゼロショット/少数ショットタスクにおいて、Emuは最先端の大規模マルチモーダルモデルと比較しても優れた性能を示す。さらに、指示チューニングによるマルチモーダルアシスタントなどの拡張機能も、印象的な性能で実証されている。

English

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

マルチモダリティにおける生成的事前学習

Generative Pretraining in Multimodality

要旨

Support