ChatPaper.aiChatPaper

mPLUG-Owl3:走向多模态大语言模型中的长图像序列理解

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

August 9, 2024
作者: Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
cs.AI

摘要

多模态大型语言模型(MLLMs)在执行各种单图任务的指令方面展现出了显著的能力。尽管取得了这一进展,但在对长图像序列建模方面仍存在重大挑战。在这项工作中,我们介绍了多功能的多模态大型语言模型 mPLUG-Owl3,它增强了在涉及检索的图像文本知识、交错的图像文本和漫长视频的场景中对长图像序列的理解能力。具体而言,我们提出了新颖的超级注意力块,以高效地将视觉和语言整合到一个共同的语言引导的语义空间中,从而促进对扩展的多图像场景的处理。广泛的实验结果表明,mPLUG-Owl3 在单图像、多图像和视频基准测试中达到了同等规模模型中最先进的性能。此外,我们提出了一个名为“分心抵抗”的具有挑战性的长视觉序列评估,以评估模型在分心干扰中保持专注的能力。最后,通过提出的架构,mPLUG-Owl3 在超长视觉序列输入上展现出了出色的性能。我们希望 mPLUG-Owl3 能为更高效、更强大的多模态大型语言模型的发展做出贡献。
English
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Summary

AI-Generated Summary

PDF352November 28, 2024