mPLUG-Owl2: 모달리티 협업을 통한 다중 모달 대형 언어 모델의 혁신

초록

다중 모달 대형 언어 모델(Multi-modal Large Language Models, MLLMs)은 다양한 개방형 작업에서 인상적인 지시 능력을 보여왔습니다. 그러나 기존 방법들은 주로 다중 모달 능력 향상에 초점을 맞추어 왔습니다. 본 연구에서는 텍스트 및 다중 모달 작업에서의 성능을 향상시키기 위해 모달리티 간 협력을 효과적으로 활용하는 다용도 다중 모달 대형 언어 모델인 mPLUG-Owl2를 소개합니다. mPLUG-Owl2는 모듈화된 네트워크 설계를 채택하며, 언어 디코더가 다양한 모달리티를 관리하는 범용 인터페이스 역할을 합니다. 구체적으로, mPLUG-Owl2는 모달리티 협력을 촉진하기 위해 공유 기능 모듈을 통합하고, 모달리티별 특성을 보존하는 모달리티 적응형 모듈을 도입합니다. 광범위한 실험을 통해 mPLUG-Owl2가 텍스트 작업과 다중 모달 작업 모두를 일반화할 수 있으며, 단일 일반 모델로 최첨단 성능을 달성할 수 있음을 확인했습니다. 특히, mPLUG-Owl2는 순수 텍스트 및 다중 모달 시나리오 모두에서 모달리티 협력 현상을 보여주는 최초의 MLLM 모델로서, 향후 다중 모달 기반 모델 개발에 있어 선구적인 길을 열었습니다.

English

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

mPLUG-Owl2: 모달리티 협업을 통한 다중 모달 대형 언어 모델의 혁신

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

초록

Support