密集連接器用於多層語言模型(Multi-Layer Language Models)

摘要

在多模式大型語言模型（MLLMs）中，我們是否充分發揮了視覺編碼器的潛力？MLLMs在多模式理解方面最近取得的優異表現引起了學術界和工業界的廣泛關注。在當前的MLLM競爭中，焦點似乎主要集中在語言方面。我們目睹了更大、更高質量的指導數據集的崛起，以及更大型LLMs的參與。然而，對MLLMs使用的視覺信號很少受到關注，通常被認為是凍結視覺編碼器提取的最終高級特徵。在本文中，我們介紹了密集連接器 - 一種簡單、有效且即插即用的視覺語言連接器，通過利用多層視覺特徵，極大地增強了現有的MLLMs，而額外的計算負擔則極小。此外，我們的模型僅通過圖像訓練，在視頻理解方面展示出卓越的零樣本能力。在各種視覺編碼器、圖像分辨率、訓練數據集規模、LLMs的不同大小（2.7B->70B）以及MLLMs的不同架構（例如LLaVA和Mini-Gemini）之間的實驗結果驗證了我們方法的多功能性和可擴展性，在19個圖像和視頻基準測試中實現了最先進的性能。我們希望這項工作將提供寶貴的經驗，並為未來MLLM發展提供基本模塊。

English

Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development.

密集連接器用於多層語言模型(Multi-Layer Language Models)

Dense Connector for MLLMs

摘要

Support