Eagle：探索具有混合编码器的多模态LLM的设计空间

摘要

准确解释复杂视觉信息的能力是多模态大语言模型（MLLMs）的一个关键主题。最近的研究表明，增强的视觉感知显著减少了幻觉，并改善了对分辨率敏感任务的表现，例如光学字符识别和文档分析。一些最近的MLLMs通过使用多种视觉编码器的混合来实现这一目标。尽管它们取得了成功，但缺乏系统性比较和详细的消融研究，以解决关键方面，如专家选择和多个视觉专家的整合。本研究对使用多种视觉编码器和分辨率的MLLMs的设计空间进行了广泛探索。我们的研究结果揭示了一些存在于各种现有策略中的基本原则，导致了一种简化而有效的设计方法。我们发现，简单地将一组互补的视觉编码器的视觉标记串联起来与更复杂的混合架构或策略一样有效。我们另外引入了Pre-Alignment来弥合以视觉为重点的编码器和语言标记之间的差距，增强模型的连贯性。由此产生的MLLMs系列Eagle，在主要MLLM基准测试中超越了其他领先的开源模型。模型和代码：https://github.com/NVlabs/Eagle

English

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle

Eagle：探索具有混合编码器的多模态LLM的设计空间

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

摘要

Support