密集連接器用於多層語言模型(Multi-Layer Language Models)
Dense Connector for MLLMs
May 22, 2024
作者: Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang
cs.AI
摘要
在多模式大型語言模型(MLLMs)中,我們是否充分發揮了視覺編碼器的潛力?MLLMs在多模式理解方面最近取得的優異表現引起了學術界和工業界的廣泛關注。在當前的MLLM競爭中,焦點似乎主要集中在語言方面。我們目睹了更大、更高質量的指導數據集的崛起,以及更大型LLMs的參與。然而,對MLLMs使用的視覺信號很少受到關注,通常被認為是凍結視覺編碼器提取的最終高級特徵。在本文中,我們介紹了密集連接器 - 一種簡單、有效且即插即用的視覺語言連接器,通過利用多層視覺特徵,極大地增強了現有的MLLMs,而額外的計算負擔則極小。此外,我們的模型僅通過圖像訓練,在視頻理解方面展示出卓越的零樣本能力。在各種視覺編碼器、圖像分辨率、訓練數據集規模、LLMs的不同大小(2.7B->70B)以及MLLMs的不同架構(例如LLaVA和Mini-Gemini)之間的實驗結果驗證了我們方法的多功能性和可擴展性,在19個圖像和視頻基準測試中實現了最先進的性能。我們希望這項工作將提供寶貴的經驗,並為未來MLLM發展提供基本模塊。
English
Do we fully leverage the potential of visual encoder in Multimodal Large
Language Models (MLLMs)? The recent outstanding performance of MLLMs in
multimodal understanding has garnered broad attention from both academia and
industry. In the current MLLM rat race, the focus seems to be predominantly on
the linguistic side. We witness the rise of larger and higher-quality
instruction datasets, as well as the involvement of larger-sized LLMs. Yet,
scant attention has been directed towards the visual signals utilized by MLLMs,
often assumed to be the final high-level features extracted by a frozen visual
encoder. In this paper, we introduce the Dense Connector - a simple, effective,
and plug-and-play vision-language connector that significantly enhances
existing MLLMs by leveraging multi-layer visual features, with minimal
additional computational overhead. Furthermore, our model, trained solely on
images, showcases remarkable zero-shot capabilities in video understanding as
well. Experimental results across various vision encoders, image resolutions,
training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse
architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility
and scalability of our approach, achieving state-of-the-art performance on
across 19 image and video benchmarks. We hope that this work will provide
valuable experience and serve as a basic module for future MLLM development.