ChatPaper.aiChatPaper

mPLUG-Owl3:朝向多模式大型語言模型中的長圖像序列理解

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

August 9, 2024
作者: Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
cs.AI

摘要

多模式大型語言模型(MLLMs)已展現出在執行各種單圖任務指令方面的卓越能力。儘管取得了進展,但在建模長圖像序列方面仍存在重大挑戰。在這項工作中,我們介紹了多功能多模式大型語言模型 mPLUG-Owl3,它增強了對於包含檢索的圖像-文本知識、交錯的圖像-文本和冗長視頻的長圖像序列理解能力。具體來說,我們提出了新穎的超級注意力塊,以有效整合視覺和語言到一個共同的語言引導語義空間,從而促進對擴展的多圖像情境的處理。廣泛的實驗結果表明,mPLUG-Owl3在單圖像、多圖像和視頻基準測試中達到了同等大小模型中的最先進性能。此外,我們提出了一個具有挑戰性的長視覺序列評估,名為抗干擾能力,以評估模型在分心干擾中保持專注的能力。最後,通過所提出的架構,mPLUG-Owl3展現了在超長視覺序列輸入上的優異表現。我們希望 mPLUG-Owl3 能有助於更高效、更強大的多模式大型語言模型的發展。
English
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Summary

AI-Generated Summary

PDF352November 28, 2024