LLaVA-UHD：任意のアスペクト比と高解像度画像を認識するLMM

要旨

視覚符号化は、大規模マルチモーダルモデル（LMMs）が視覚世界を理解するための基盤を構成します。従来のLMMsは固定サイズかつ限られた解像度で画像を処理しますが、この方向性における最近の探求は適応性、効率性、さらには正確性においても限界があります。本論文ではまず、GPT-4VとLLaVA-1.5を代表例として取り上げ、それらの視覚符号化戦略に根ざした体系的な欠陥を明らかにします。これらの課題に対処するため、任意のアスペクト比と高解像度で画像を効率的に認識できる大規模マルチモーダルモデル、LLaVA-UHDを提案します。LLaVA-UHDは以下の3つの主要なコンポーネントを含みます：（1）ネイティブ解像度の画像を効率的かつ拡張可能な符号化のために可変サイズのスライスに分割する画像モジュール化戦略、（2）視覚エンコーダからの画像トークンをさらに圧縮する圧縮モジュール、（3）LLM向けにスライストークンを整理する空間スキーマ。包括的な実験により、LLaVA-UHDが2～3桁多いデータで訓練された既存のLMMsを9つのベンチマークで上回ることが示されました。特に、LLaVA-1.5 336x336を基に構築した我々のモデルは、6倍大きい解像度（672x1088）の画像を94%の推論計算量でサポートし、TextVQAにおいて6.4ポイントの精度向上を達成しました。さらに、このモデルは学術環境において効率的に訓練可能で、8台のA100 GPU上で23時間（LLaVA-1.5の26時間に対して）で訓練できます。データとコードはhttps://github.com/thunlp/LLaVA-UHDで公開しています。

English

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

LLaVA-UHD：任意のアスペクト比と高解像度画像を認識するLMM

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

要旨

Support