ChatPaper.aiChatPaper

LLaVA-UHD:一種能感知任何長寬比和高解析度圖像的LMM。

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

March 18, 2024
作者: Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang
cs.AI

摘要

視覺編碼是理解視覺世界的大型多模型(LMMs)的基礎。傳統的LMMs處理固定尺寸和有限分辨率的圖像,而最近在這個方向上的探索受限於適應性、效率,甚至正確性。在這項工作中,我們首先以GPT-4V和LLaVA-1.5作為代表性例子,揭示了它們的視覺編碼策略中存在的系統性缺陷。為應對這些挑戰,我們提出了LLaVA-UHD,一個大型多模型,能夠高效地感知任何長寬比和高分辨率的圖像。LLaVA-UHD包括三個關鍵組件:(1)一種圖像模塊化策略,將原始分辨率圖像分成較小的可變大小片段,以進行高效且可擴展的編碼,(2)一個壓縮模塊,進一步壓縮來自視覺編碼器的圖像標記,以及(3)一個空間模式,用於組織片段標記以供LLMs使用。全面的實驗表明,LLaVA-UHD在9個基準測試中優於使用2-3個數量級更多數據訓練的已建立的LMMs。值得注意的是,我們基於LLaVA-1.5 336x336構建的模型支持6倍更大(即672x1088)分辨率的圖像,僅使用94%的推理計算,並在TextVQA上實現6.4的準確性改進。此外,該模型可以在學術環境中高效地訓練,在8個A100 GPU上僅需23小時(相較於LLaVA-1.5的26小時)。我們將數據和代碼公開發布在https://github.com/thunlp/LLaVA-UHD。
English
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

Summary

AI-Generated Summary

PDF171December 15, 2024