統一-IO 2：利用視覺、語言、音訊和動作擴展自回歸多模型

摘要

我們提出了Unified-IO 2，這是第一個能夠理解和生成影像、文本、音訊和動作的自回歸多模態模型。為了統一不同的模態，我們將輸入和輸出（如影像、文本、音訊、動作、邊界框等）進行標記化，轉換為共享的語義空間，然後使用單一的編碼器-解碼器Transformer模型進行處理。由於使用如此多樣化的模態進行訓練具有挑戰性，我們提出了各種架構改進來穩定模型訓練。我們從頭開始在來自不同來源的大型多模態預訓練語料庫上訓練我們的模型，並使用多模態混合去噪目標。為了學習廣泛的技能，例如遵循多模態指令，我們構建並在包含提示和增強的120個數據集上進行微調。通過單一統一模型，Unified-IO 2在GRIT基準測試中實現了最先進的性能，在超過35個基準測試中取得了強大的結果，包括影像生成和理解、自然語言理解、視頻和音訊理解以及機器人操作。我們將所有模型釋放給研究社群。

English

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

統一-IO 2：利用視覺、語言、音訊和動作擴展自回歸多模型

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

摘要

Support