mPLUG-Owl2: 通过模态协作彻底改变多模态大型语言模型
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
November 7, 2023
作者: Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou, Anwen HU
cs.AI
摘要
多模态大型语言模型(MLLMs)已经展示了在各种开放式任务中令人印象深刻的指导能力。然而,先前的方法主要集中在增强多模态能力上。在这项工作中,我们引入了一种多功能多模态大型语言模型,mPLUG-Owl2,它有效地利用模态协作来提高文本和多模态任务的性能。mPLUG-Owl2采用模块化网络设计,语言解码器充当管理不同模态的通用接口。具体而言,mPLUG-Owl2整合了共享功能模块以促进模态协作,并引入了保留模态特定特征的模态自适应模块。大量实验证明,mPLUG-Owl2能够推广文本任务和多模态任务,并且通过单一通用模型实现最先进的性能。值得注意的是,mPLUG-Owl2是第一个在纯文本和多模态场景中展示模态协作现象的MLLM模型,为未来多模态基础模型的发展开辟了先驱之路。
English
Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However, previous
methods primarily focus on enhancing multi-modal capabilities. In this work, we
introduce a versatile multi-modal large language model, mPLUG-Owl2, which
effectively leverages modality collaboration to improve performance in both
text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design,
with the language decoder acting as a universal interface for managing
different modalities. Specifically, mPLUG-Owl2 incorporates shared functional
modules to facilitate modality collaboration and introduces a modality-adaptive
module that preserves modality-specific features. Extensive experiments reveal
that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal
tasks and achieving state-of-the-art performances with a single generic model.
Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality
collaboration phenomenon in both pure-text and multi-modal scenarios, setting a
pioneering path in the development of future multi-modal foundation models.