ChatPaper.aiChatPaper

mPLUG-Owl2:透過模態協作改革多模大型語言模型

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

November 7, 2023
作者: Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou, Anwen HU
cs.AI

摘要

多模式大型語言模型(MLLMs)已展示出在各種開放式任務中具有令人印象深刻的指導能力。然而,先前的方法主要集中在增強多模式能力上。在這項工作中,我們介紹了一個多功能的多模式大型語言模型,mPLUG-Owl2,它有效地利用模態協作來提高文本和多模式任務的表現。mPLUG-Owl2採用模塊化網絡設計,語言解碼器作為管理不同模式的通用接口。具體而言,mPLUG-Owl2包含共享功能模塊以促進模態協作,並引入保留模態特定特徵的模態適應模塊。廣泛的實驗顯示,mPLUG-Owl2能夠泛化文本任務和多模式任務,並以單一通用模型實現最先進的性能。值得注意的是,mPLUG-Owl2是第一個在純文本和多模式情境中展示模態協作現象的MLLM模型,為未來多模式基礎模型的發展開辟了先鋒之路。
English
Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
PDF222December 15, 2024