독수리: 다중 모달 LLMs의 디자인 공간 탐색과 인코더 혼합

초록

복잡한 시각 정보를 정확하게 해석하는 능력은 다중 모달 대형 언어 모델 (MLLMs)의 중요한 주제입니다. 최근 연구 결과에 따르면 강화된 시각 지각은 환각을 크게 줄이고 광학 문자 인식 및 문서 분석과 같은 해상도에 민감한 작업에서 성능을 향상시킵니다. 최근 MLLMs 중 일부는 시각 인코더의 혼합을 사용하여 이 목표를 달성합니다. 그러나 그들의 성공에도 불구하고, 전문가 선택 및 여러 시각 전문가의 통합과 같은 중요한 측면을 다루는 체계적인 비교와 상세한 제거 연구가 부족합니다. 본 연구는 시각 인코더와 해상도를 혼합하여 MLLMs의 설계 공간을 체계적으로 탐색합니다. 우리의 연구 결과는 다양한 기존 전략에 공통적인 몇 가지 기본 원칙을 밝혀내어 간소화되고 효과적인 설계 접근 방식으로 이끕니다. 우리는 단순히 상호 보완적인 시각 인코더 집합에서 시각 토큰을 연결하는 것이 더 복잡한 혼합 구조나 전략만큼 효과적이라는 것을 발견합니다. 또한 시각 중심 인코더와 언어 토큰 사이의 간극을 줄이는 Pre-Alignment을 소개하여 모델 일관성을 향상시킵니다. 결과적으로 Eagle이라는 MLLMs 패밀리는 주요 MLLM 벤치마크에서 다른 선도적인 오픈 소스 모델을 능가합니다. 모델 및 코드: https://github.com/NVlabs/Eagle

English

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle

독수리: 다중 모달 LLMs의 디자인 공간 탐색과 인코더 혼합

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

초록

Summary

Support

Support