mPLUG-Owl2：透過模態協作改革多模大型語言模型

摘要

多模式大型語言模型（MLLMs）已展示出在各種開放式任務中具有令人印象深刻的指導能力。然而，先前的方法主要集中在增強多模式能力上。在這項工作中，我們介紹了一個多功能的多模式大型語言模型，mPLUG-Owl2，它有效地利用模態協作來提高文本和多模式任務的表現。mPLUG-Owl2採用模塊化網絡設計，語言解碼器作為管理不同模式的通用接口。具體而言，mPLUG-Owl2包含共享功能模塊以促進模態協作，並引入保留模態特定特徵的模態適應模塊。廣泛的實驗顯示，mPLUG-Owl2能夠泛化文本任務和多模式任務，並以單一通用模型實現最先進的性能。值得注意的是，mPLUG-Owl2是第一個在純文本和多模式情境中展示模態協作現象的MLLM模型，為未來多模式基礎模型的發展開辟了先鋒之路。

English

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

mPLUG-Owl2：透過模態協作改革多模大型語言模型

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

摘要

Support