多模態生成預訓練

摘要

我們提出了 Emu，一個基於 Transformer 的多模態基礎模型，能夠無縫生成多模態情境中的圖像和文本。這個全能模型可以通過一個統一的自回歸訓練過程，不加區別地接受任何單模態或多模態的數據輸入（例如，交錯的圖像、文本和視頻）。首先，將視覺信號編碼為嵌入，然後與文本標記一起形成一個交錯的輸入序列。Emu 接著通過統一的目標進行端到端訓練，該目標是對多模態序列中的下一個文本標記進行分類，或者回歸出下一個視覺嵌入。這種多功能的多模態性賦予了在規模上探索多樣的預訓練數據來源的能力，例如包含交錯幀和文本的視頻、包含交錯圖像和文本的網頁，以及規模宏大的圖像-文本對和視頻-文本對。Emu 可以作為一個通用的多模態界面，用於圖像到文本和文本到圖像的任務，並支持上下文中的圖像和文本生成。在廣泛的零樣本/少樣本任務範疇中，包括圖像標題生成、視覺問答、視頻問答和文本到圖像生成，Emu 相較於最先進的大型多模態模型展現出卓越的性能。同時，通過指導調整實現了多模態助手等擴展功能，並展現出令人印象深刻的性能。

English

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

多模態生成預訓練

Generative Pretraining in Multimodality

摘要

Support