生成多模態模型是上下文學習者。

摘要

人類能夠輕鬆解決多模式任務並在上下文中操作（即，僅需少量示範或簡單指示），這是目前多模式系統在模仿上大多數困難的地方。在這項工作中，我們展示了大型多模式模型的任務不可知上下文學習能力可以通過有效的擴展大幅增強。我們引入了Emu2，一個具有370億參數的生成式多模式模型，通過統一的自回歸目標在大規模多模式序列上進行訓練。Emu2展現出強大的多模式上下文學習能力，甚至能夠解決需要即時推理的任務，例如視覺提示和對象導向生成。該模型在少量示範設置下在多個多模式理解任務上創下了新紀錄。當根據特定指示進行調整後，Emu2在挑戰性任務上取得了新的最先進成果，例如大型多模式模型的問答基準測試和開放式主題驅動生成。這些成就表明Emu2可以作為基礎模型和廣泛多模式任務的通用接口。代碼和模型已公開提供，以促進未來研究。

English

The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.

生成多模態模型是上下文學習者。

Generative Multimodal Models are In-Context Learners

摘要

Support