多模态生成预训练

摘要

我们提出了Emu，这是一个基于Transformer的多模态基础模型，能够无缝地在多模态环境中生成图像和文本。这个全能模型可以通过一个统一的自回归训练过程，接受任何单模态或多模态数据输入（例如，交错的图像、文本和视频）。首先，视觉信号被编码为嵌入，然后与文本标记一起形成一个交错的输入序列。Emu随后进行端到端训练，以统一的目标对多模态序列中的下一个文本标记进行分类或回归下一个视觉嵌入。这种多功能的多模态性使得可以在规模上探索各种各样的预训练数据源，例如交错帧和文本的视频、交错图像和文本的网页，以及大规模的图像-文本对和视频-文本对。Emu可以作为通用的多模态接口，用于图像到文本和文本到图像的任务，并支持上下文中的图像和文本生成。在包括图像描述、视觉问答、视频问答和文本到图像生成在内的广泛范围的零样本/少样本任务中，Emu相比最先进的大型多模态模型表现出卓越的性能。通过指导调整展示了多模态助手等扩展功能，表现出令人印象深刻的性能。

English

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

多模态生成预训练

Generative Pretraining in Multimodality

摘要

Support