密集连接器用于多层语言模型（MLLMs）

摘要

在多模态大型语言模型（MLLMs）中，我们是否充分发挥了视觉编码器的潜力？最近MLLMs在多模态理解方面取得的出色表现引起了学术界和工业界的广泛关注。在当前MLLM的激烈竞争中，焦点似乎主要集中在语言方面。我们看到更大规模和更高质量的指导数据集的崛起，以及更大规模的LLMs的参与。然而，很少有注意力被引导到MLLMs利用的视觉信号上，通常被假定为由冻结的视觉编码器提取的最终高级特征。在本文中，我们介绍了密集连接器 - 一种简单、有效且即插即用的视觉-语言连接器，通过利用多层视觉特征显著增强了现有的MLLMs，而额外的计算开销很小。此外，我们的模型仅在图像上训练，展示了在视频理解方面显著的零样本能力。在各种视觉编码器、图像分辨率、训练数据集规模、LLMs的不同规模（2.7B->70B）以及MLLMs的不同架构（例如LLaVA和Mini-Gemini）上的实验结果验证了我们方法的多功能性和可扩展性，在19个图像和视频基准测试中实现了最先进的性能。我们希望这项工作能提供宝贵的经验，并为未来MLLM的发展提供基本模块。

English

Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development.

密集连接器用于多层语言模型（MLLMs）

Dense Connector for MLLMs

摘要

Support