统一输入输出2：利用视觉、语言、音频和动作扩展自回归多模态模型

摘要

我们介绍了Unified-IO 2，这是第一个能够理解和生成图像、文本、音频和动作的自回归多模态模型。为了统一不同的模态，我们将输入和输出（图像、文本、音频、动作、边界框等）进行标记化，置于共享的语义空间中，然后使用单个编码器-解码器Transformer模型进行处理。由于使用如此多样化的模态进行训练具有挑战性，我们提出了各种架构改进来稳定模型训练。我们从各种来源的大型多模态预训练语料库中，使用多模态去噪目标对我们的模型从头开始进行训练。为了学习广泛的技能，比如遵循多模态指令，我们构建了一个包含120个数据集、提示和增强的集合，并在其上进行微调。通过单一统一模型，Unified-IO 2在GRIT基准测试中取得了最先进的性能，并在超过35个基准测试中取得了强大的结果，包括图像生成和理解、自然语言理解、视频和音频理解以及机器人操作。我们向研究界发布了我们的所有模型。

English

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

统一输入输出2：利用视觉、语言、音频和动作扩展自回归多模态模型

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

摘要

Support