生成多模态模型是上下文学习者。

摘要

人类能够轻松解决上下文中的多模态任务（即，仅凭少量演示或简单指示），这是当前多模态系统在模仿时往往遇到困难的地方。在这项工作中，我们展示了大型多模态模型的任务不可知上下文学习能力可以通过有效的扩展显著增强。我们引入了Emu2，一个拥有370亿参数的生成式多模态模型，使用统一的自回归目标在大规模多模态序列上进行训练。Emu2展现出强大的多模态上下文学习能力，甚至能够解决需要即时推理的任务，如视觉提示和基于对象的生成。该模型在少样本设置下在多个多模态理解任务上创下了新纪录。当根据特定指示进行调整后，Emu2在挑战性任务上取得了新的最先进水平，例如针对大型多模态模型的问答基准测试和开放式主题驱动生成。这些成就表明Emu2可以作为基础模型和多模态任务的通用接口，代码和模型已公开提供以促进未来研究。

English

The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.

生成多模态模型是上下文学习者。

Generative Multimodal Models are In-Context Learners

摘要

Support