생성형 멀티모달 모델은 인컨텍스트 학습자입니다.

초록

컨텍스트 내에서(즉, 몇 가지 데모나 간단한 지시만으로) 다중 모달 작업을 쉽게 해결하는 인간의 능력은 현재의 다중 모달 시스템이 크게 모방하지 못하는 부분입니다. 본 연구에서는 대규모 다중 모달 모델의 작업에 구애받지 않는 컨텍스트 내 학습 능력이 효과적인 스케일 업을 통해 크게 향상될 수 있음을 보여줍니다. 우리는 통합된 자기회귀 목표로 대규모 다중 모달 시퀀스에 대해 훈련된 370억 개의 파라미터를 가진 생성형 다중 모달 모델인 Emu2를 소개합니다. Emu2는 시각적 프롬프팅 및 객체 기반 생성과 같이 즉석에서 추론이 필요한 작업을 해결하는 데까지 이르는 강력한 다중 모달 컨텍스트 내 학습 능력을 보여줍니다. 이 모델은 퓨샷 설정에서 여러 다중 모달 이해 작업에서 새로운 기록을 세웁니다. 특정 지시를 따르도록 지시 튜닝을 받은 Emu2는 대규모 다중 모달 모델을 위한 질문 응답 벤치마크 및 개방형 주제 기반 생성과 같은 도전적인 작업에서 새로운 최첨단 성과를 달성합니다. 이러한 성과는 Emu2가 다양한 다중 모달 작업을 위한 기본 모델 및 범용 인터페이스로 사용될 수 있음을 보여줍니다. 향후 연구를 촉진하기 위해 코드와 모델을 공개적으로 제공합니다.

English

The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.

생성형 멀티모달 모델은 인컨텍스트 학습자입니다.

Generative Multimodal Models are In-Context Learners

초록

Support