Eagle:探索具有混合編碼器的多模態LLM設計空間
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
August 28, 2024
作者: Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, Guilin Liu
cs.AI
摘要
準確解釋複雜視覺信息的能力是多模式大型語言模型(MLLMs)的一個關鍵話題。最近的研究表明,增強的視覺感知顯著減少幻覺,並改善對分辨率敏感任務的表現,例如光學字符識別和文檔分析。一些最近的MLLMs通過使用多種視覺編碼器來實現這一目標。儘管它們取得了成功,但缺乏系統性比較和詳細的剔除研究,解決關鍵問題,如專家選擇和多個視覺專家的整合。本研究對使用多種視覺編碼器和分辨率的MLLMs的設計空間進行了廣泛探索。我們的研究發現了幾個潛在原則,這些原則適用於各種現有策略,從而引導出一種簡化而有效的設計方法。我們發現,簡單地將來自一組互補視覺編碼器的視覺標記串聯起來,與更複雜的混合架構或策略一樣有效。此外,我們引入了預對齊(Pre-Alignment)來彌合以視覺為重點的編碼器和語言標記之間的差距,增強模型的一致性。由此產生的MLLMs系列Eagle,在主要MLLM基準測試中超越其他領先的開源模型。模型和代碼:https://github.com/NVlabs/Eagle
English
The ability to accurately interpret complex visual information is a crucial
topic of multimodal large language models (MLLMs). Recent work indicates that
enhanced visual perception significantly reduces hallucinations and improves
performance on resolution-sensitive tasks, such as optical character
recognition and document analysis. A number of recent MLLMs achieve this goal
using a mixture of vision encoders. Despite their success, there is a lack of
systematic comparisons and detailed ablation studies addressing critical
aspects, such as expert selection and the integration of multiple vision
experts. This study provides an extensive exploration of the design space for
MLLMs using a mixture of vision encoders and resolutions. Our findings reveal
several underlying principles common to various existing strategies, leading to
a streamlined yet effective design approach. We discover that simply
concatenating visual tokens from a set of complementary vision encoders is as
effective as more complex mixing architectures or strategies. We additionally
introduce Pre-Alignment to bridge the gap between vision-focused encoders and
language tokens, enhancing model coherence. The resulting family of MLLMs,
Eagle, surpasses other leading open-source models on major MLLM benchmarks.
Models and code: https://github.com/NVlabs/EagleSummary
AI-Generated Summary