Honeybee：マルチモーダルLLMのための局所性強化プロジェクター

要旨

マルチモーダル大規模言語モデル（MLLMs）において、ビジュアルプロジェクターは、事前学習済みの視覚エンコーダーとLLMsを橋渡しする重要な役割を果たし、LLMsの強力な能力を活用しながら深い視覚理解を可能にします。ビジュアルプロジェクターの重要性にもかかわらず、これまで比較的あまり研究されてきませんでした。本研究では、まず2つの重要なプロジェクターの特性を特定します：(i) 視覚トークンの数を管理する柔軟性（MLLMsの全体的な効率にとって重要）と、(ii) 視覚特徴からローカルコンテキストを保持すること（空間理解にとって重要）。これらの知見に基づき、我々は柔軟性と局所性を強化した新しいプロジェクターデザインを提案し、これら2つの望ましい特性を効果的に満たします。さらに、複数の多面的な指示データセットを効果的に活用するための包括的な戦略を提示します。広範な実験を通じて、個々の設計選択の影響を検証します。最後に、我々が提案するMLLM「Honeybee」は、MME、MMBench、SEED-Bench、LLaVA-Benchなどの様々なベンチマークにおいて、従来の最先端手法を大幅に上回る性能を発揮し、著しく高い効率を達成します。コードとモデルはhttps://github.com/kakaobrain/honeybeeで公開されています。

English

In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee.

Honeybee：マルチモーダルLLMのための局所性強化プロジェクター

Honeybee: Locality-enhanced Projector for Multimodal LLM

要旨

Support