mPLUG-Owl3: 다중 모달에서 장기 이미지 시퀀스 이해를 향하여 대규모 언어 모델

초록

다중 모달 대형 언어 모델(MLLMs)은 다양한 단일 이미지 작업에 대한 지시를 실행하는 놀라운 능력을 보여주었습니다. 이러한 진전에도 불구하고, 긴 이미지 시퀀스를 모델링하는 데 중요한 도전 과제가 남아 있습니다. 본 연구에서는 다양한 이미지-텍스트 지식을 통합하고 교차된 이미지-텍스트 및 긴 비디오를 포함하는 시나리오에서 긴 이미지 시퀀스 이해 능력을 향상시키는 다목적 다중 모달 대형 언어 모델인 mPLUG-Owl3을 소개합니다. 구체적으로, 우리는 새로운 하이퍼 어텐션 블록을 제안하여 시각과 언어를 효율적으로 통합하여 공통 언어로 안내된 의미 공간으로 처리를 용이하게 하여 확장된 다중 이미지 시나리오를 처리합니다. 광범위한 실험 결과는 mPLUG-Owl3이 유사한 크기의 모델들 중에서 단일 이미지, 다중 이미지 및 비디오 벤치마크에서 최고 수준의 성능을 달성한다는 것을 시사합니다. 더불어, 우리는 모델이 주의를 집중시키는 능력을 평가하는 어려운 긴 시각적 시퀀스 평가인 Distractor Resistance를 제안합니다. 마지막으로, 제안된 아키텍처로 mPLUG-Owl3은 초장기 시각적 시퀀스 입력에서 뛰어난 성능을 보여줍니다. 우리는 mPLUG-Owl3이 더 효율적이고 강력한 다중 모달 대형 언어 모델의 발전에 기여할 수 있기를 희망합니다.

English

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.