蜜蜂：用于多模态LLM的增强局部性投影器

摘要

在多模态大型语言模型（MLLMs）中，视觉投影器在连接预训练视觉编码器与LLMs方面发挥着关键作用，实现了深刻的视觉理解，同时利用了LLMs的强大能力。尽管视觉投影器的重要性不言而喻，但研究相对较少。在这项研究中，我们首先确定了两个关键的投影器属性：（i）灵活性，能够管理视觉标记的数量，对MLLMs的整体效率至关重要；（ii）保留来自视觉特征的局部上下文，对空间理解至关重要。基于这些发现，我们提出了一种新颖的投影器设计，既具有灵活性又增强了局部性，有效地满足了这两个理想属性。此外，我们提出了全面的策略，以有效利用多个和多方面的指导数据集。通过大量实验，我们研究了各种设计选择的影响。最后，我们提出的MLLM模型Honeybee，在各种基准测试中显著优于先前的最先进方法，包括MME、MMBench、SEED-Bench和LLaVA-Bench，实现了显著更高的效率。代码和模型可在https://github.com/kakaobrain/honeybee 上获得。

English

In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee.

蜜蜂：用于多模态LLM的增强局部性投影器

Honeybee: Locality-enhanced Projector for Multimodal LLM

摘要

Support