蜜蜂：用於多模態LLM的增強區域投影器

摘要

在多模式大型語言模型（MLLMs）中，視覺投影器在將預先訓練的視覺編碼器與LLMs連接方面發揮著關鍵作用，實現深刻的視覺理解，同時利用LLMs的強大功能。儘管視覺投影器的重要性不言而喻，但相對較少有人探索。在本研究中，我們首先確定了兩個重要的投影器特性：（i）靈活管理視覺標記數量，對於MLLMs的整體效率至關重要，以及（ii）保留來自視覺特徵的局部上下文，對於空間理解至關重要。基於這些發現，我們提出了一種新穎的投影器設計，既具靈活性又增強了局部性，有效滿足了這兩個理想特性。此外，我們提出了全面的策略，以有效利用多個和多方面的指導數據集。通過大量實驗，我們檢驗了個別設計選擇的影響。最後，我們提出的MLLM，Honeybee，在各種基準測試中顯著優於以往的最先進方法，包括MME、MMBench、SEED-Bench和LLaVA-Bench，實現了顯著更高的效率。代碼和模型可在https://github.com/kakaobrain/honeybee 上找到。

English

In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee.

蜜蜂：用於多模態LLM的增強區域投影器

Honeybee: Locality-enhanced Projector for Multimodal LLM

摘要

Support