Eagle：探索具有混合編碼器的多模態LLM設計空間

摘要

準確解釋複雜視覺信息的能力是多模式大型語言模型（MLLMs）的一個關鍵話題。最近的研究表明，增強的視覺感知顯著減少幻覺，並改善對分辨率敏感任務的表現，例如光學字符識別和文檔分析。一些最近的MLLMs通過使用多種視覺編碼器來實現這一目標。儘管它們取得了成功，但缺乏系統性比較和詳細的剔除研究，解決關鍵問題，如專家選擇和多個視覺專家的整合。本研究對使用多種視覺編碼器和分辨率的MLLMs的設計空間進行了廣泛探索。我們的研究發現了幾個潛在原則，這些原則適用於各種現有策略，從而引導出一種簡化而有效的設計方法。我們發現，簡單地將來自一組互補視覺編碼器的視覺標記串聯起來，與更複雜的混合架構或策略一樣有效。此外，我們引入了預對齊（Pre-Alignment）來彌合以視覺為重點的編碼器和語言標記之間的差距，增強模型的一致性。由此產生的MLLMs系列Eagle，在主要MLLM基準測試中超越其他領先的開源模型。模型和代碼：https://github.com/NVlabs/Eagle

English

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle

Eagle：探索具有混合編碼器的多模態LLM設計空間

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

摘要

Support