ChatPaper.aiChatPaper

LLaVA-UHD:一种能感知任何长宽比和高分辨率图像的LMM。

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

March 18, 2024
作者: Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang
cs.AI

摘要

视觉编码构成了大型多模态模型(LMMs)理解视觉世界的基础。传统LMMs处理固定尺寸和有限分辨率的图像,而最近在这个方向的探索在适应性、效率甚至正确性方面受到限制。在这项工作中,我们首先以GPT-4V和LLaVA-1.5作为代表性示例,揭示了它们的视觉编码策略中根植的系统缺陷。为了解决挑战,我们提出了LLaVA-UHD,一个大型多模态模型,可以高效地感知任何纵横比和高分辨率的图像。LLaVA-UHD包括三个关键组件:(1)图像模块化策略,将原始分辨率图像划分为更小的可变大小片段,以便进行高效和可扩展的编码,(2)一个压缩模块,进一步压缩来自视觉编码器的图像标记,以及(3)一个空间模式,用于组织LLMs的片段标记。全面的实验表明,LLaVA-UHD在9个基准测试上优于使用2-3个数量级更多数据训练的已建立LMMs。值得注意的是,我们基于LLaVA-1.5 336x336构建的模型,仅使用94%的推理计算支持6倍更大(即672x1088)分辨率的图像,并在TextVQA上实现了6.4的准确度提升。此外,该模型可以在学术环境中高效地训练,在8个A100 GPU上仅需23小时(相比LLaVA-1.5的26小时)。我们在https://github.com/thunlp/LLaVA-UHD上公开提供数据和代码。
English
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

Summary

AI-Generated Summary

PDF171December 15, 2024