Unified-IO 2: 비전, 언어, 오디오, 액션을 통합한 자기회귀 다중모달 모델의 확장

초록

이미지, 텍스트, 오디오, 동작을 이해하고 생성할 수 있는 최초의 자기회귀적 멀티모달 모델인 Unified-IO 2를 소개한다. 다양한 모달리티를 통합하기 위해 입력과 출력(이미지, 텍스트, 오디오, 동작, 바운딩 박스 등)을 공유된 의미 공간으로 토큰화한 후, 단일 인코더-디코더 트랜스포머 모델로 처리한다. 이렇게 다양한 모달리티를 사용한 학습은 도전적이므로, 모델 학습을 안정화하기 위한 다양한 아키텍처 개선을 제안한다. 다양한 소스로부터 대규모 멀티모달 사전 학습 코퍼스를 사용해 모델을 처음부터 학습시키며, 멀티모달 디노이저 목표를 활용한다. 멀티모달 지시를 따르는 것과 같은 광범위한 기술을 학습하기 위해, 프롬프트와 증강 기법을 적용한 120개의 데이터셋 앙상블을 구성하고 미세 조정한다. 단일 통합 모델인 Unified-IO 2는 GRIT 벤치마크에서 최첨단 성능을 달성하며, 이미지 생성 및 이해, 자연어 이해, 비디오 및 오디오 이해, 로봇 조작 등 35개 이상의 벤치마크에서 강력한 결과를 보인다. 모든 모델을 연구 커뮤니티에 공개한다.

English

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

Unified-IO 2: 비전, 언어, 오디오, 액션을 통합한 자기회귀 다중모달 모델의 확장

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

초록

Support