다중모달리티에서의 생성적 사전학습

초록

우리는 멀티모달 컨텍스트에서 이미지와 텍스트를 원활하게 생성할 수 있는 Transformer 기반의 멀티모달 파운데이션 모델인 Emu를 소개합니다. 이 올바이버(omnivore) 모델은 단일 모달리티 또는 멀티모달리티 데이터 입력(예: 인터리브된 이미지, 텍스트, 비디오)을 구분 없이 받아들일 수 있으며, 이를 위해 '원-모델-포-올(one-model-for-all)' 자동회귀 학습 과정을 통해 학습됩니다. 먼저, 시각 신호는 임베딩으로 인코딩되고, 텍스트 토큰과 함께 인터리브된 입력 시퀀스를 형성합니다. Emu는 멀티모달 시퀀스에서 다음 텍스트 토큰을 분류하거나 다음 시각 임베딩을 회귀하는 통합 목표를 통해 엔드투엔드로 학습됩니다. 이러한 다재다능한 멀티모달리티는 인터리브된 프레임과 텍스트가 포함된 비디오, 인터리브된 이미지와 텍스트가 포함된 웹페이지, 웹 스케일의 이미지-텍스트 쌍 및 비디오-텍스트 쌍과 같은 다양한 대규모 사전 학습 데이터 소스를 탐구할 수 있게 합니다. Emu는 이미지-텍스트 및 텍스트-이미지 작업 모두를 위한 일반적인 멀티모달 인터페이스로 사용될 수 있으며, 컨텍스트 내 이미지 및 텍스트 생성을 지원합니다. 이미지 캡셔닝, 시각 질의응답, 비디오 질의응답, 텍스트-이미지 생성 등 다양한 제로샷/퓨샷 작업에서 Emu는 최첨단 대형 멀티모달 모델과 비교하여 뛰어난 성능을 보여줍니다. 또한, 명령어 튜닝을 통한 멀티모달 어시스턴트와 같은 확장 기능도 인상적인 성능으로 입증되었습니다.

English

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

다중모달리티에서의 생성적 사전학습

Generative Pretraining in Multimodality

초록

Support