LLaVA-UHD: 모든 종횡비와 고해상도 이미지를 인지하는 대형 멀티모달 모델

초록

시각적 인코딩은 대규모 멀티모달 모델(LMMs)이 시각적 세계를 이해하는 데 있어 기초를 이룹니다. 기존의 LMMs는 고정된 크기와 제한된 해상도로 이미지를 처리하는 반면, 이 방향에서의 최근 연구들은 적응성, 효율성, 심지어 정확성 측면에서 제한적입니다. 본 연구에서는 먼저 GPT-4V와 LLaVA-1.5를 대표적인 예로 삼아 그들의 시각적 인코딩 전략에 내재된 체계적인 결함을 드러냅니다. 이러한 문제를 해결하기 위해, 우리는 어떤 종횡비와 높은 해상도의 이미지도 효율적으로 인식할 수 있는 대규모 멀티모달 모델인 LLaVA-UHD를 제안합니다. LLaVA-UHD는 세 가지 주요 구성 요소를 포함합니다: (1) 원본 해상도 이미지를 더 작고 가변 크기의 조각으로 나누어 효율적이고 확장 가능한 인코딩을 가능하게 하는 이미지 모듈화 전략, (2) 시각적 인코더에서 나온 이미지 토큰을 더욱 압축하는 압축 모듈, 그리고 (3) LLM을 위한 조각 토큰을 조직화하는 공간적 스키마. 포괄적인 실험 결과, LLaVA-UHD는 2-3배 더 많은 데이터로 학습된 기존 LMMs를 9개의 벤치마크에서 능가하는 성능을 보여줍니다. 특히, LLaVA-1.5 336x336을 기반으로 구축된 우리의 모델은 6배 더 큰 해상도(즉, 672x1088)의 이미지를 단 94%의 추론 계산량으로 지원하며, TextVQA에서 6.4%의 정확도 향상을 달성합니다. 또한, 이 모델은 학술 환경에서 8개의 A100 GPU를 사용하여 23시간 내에 효율적으로 학습 가능합니다(LLaVA-1.5의 26시간 대비). 우리는 데이터와 코드를 https://github.com/thunlp/LLaVA-UHD에서 공개합니다.

English

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

LLaVA-UHD: 모든 종횡비와 고해상도 이미지를 인지하는 대형 멀티모달 모델

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

초록

Support