イーグル：エンコーダの混合を用いたマルチモーダルLLMの設計空間の探索

要旨

複雑な視覚情報を正確に解釈する能力は、多様な大規模言語モデル（MLLMs）の重要なトピックです。最近の研究では、高度な視覚認識が幻覚を軽減し、光学文字認識や文書解析などの解像度に敏感なタスクでのパフォーマンスを向上させることが示されています。いくつかの最近のMLLMsは、複数のビジョンエンコーダを組み合わせることでこの目標を達成しています。彼らの成功にもかかわらず、専門家の選択や複数のビジョン専門家の統合などの重要な側面に対処した体系的な比較や詳細な削減研究が不足しています。この研究は、ビジョンエンコーダと解像度の混合を使用したMLLMsの設計空間について包括的な探索を提供します。私たちの調査結果は、既存のさまざまな戦略に共通するいくつかの基本原則を明らかにし、効果的な設計アプローチを効率的に導きます。私たちは、単に補完的なビジョンエンコーダからの視覚トークンを連結するだけで、より複雑な混合アーキテクチャや戦略と同じくらい効果的であることを発見しました。さらに、ビジョンに焦点を当てたエンコーダと言語トークンの間のギャップを埋めるために、Pre-Alignmentを導入し、モデルの整合性を向上させます。その結果生まれたMLLMsファミリー、Eagleは、主要なMLLMベンチマークで他の主要なオープンソースモデルを上回っています。モデルとコード：https://github.com/NVlabs/Eagle

English

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle

イーグル：エンコーダの混合を用いたマルチモーダルLLMの設計空間の探索

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

要旨

Support