mPLUG-Owl2: 通过模态协作彻底改变多模态大型语言模型

摘要

多模态大型语言模型（MLLMs）已经展示了在各种开放式任务中令人印象深刻的指导能力。然而，先前的方法主要集中在增强多模态能力上。在这项工作中，我们引入了一种多功能多模态大型语言模型，mPLUG-Owl2，它有效地利用模态协作来提高文本和多模态任务的性能。mPLUG-Owl2采用模块化网络设计，语言解码器充当管理不同模态的通用接口。具体而言，mPLUG-Owl2整合了共享功能模块以促进模态协作，并引入了保留模态特定特征的模态自适应模块。大量实验证明，mPLUG-Owl2能够推广文本任务和多模态任务，并且通过单一通用模型实现最先进的性能。值得注意的是，mPLUG-Owl2是第一个在纯文本和多模态场景中展示模态协作现象的MLLM模型，为未来多模态基础模型的发展开辟了先驱之路。

English

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

mPLUG-Owl2: 通过模态协作彻底改变多模态大型语言模型

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

摘要

Support