ConvLLaVA:作为大型多模态模型的视觉编码器的分层骨干
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
May 24, 2024
作者: Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng
cs.AI
摘要
高分辨率大型多模态模型(LMMs)面临着过多的视觉标记和二次视觉复杂性的挑战。当前的高分辨率LMMs解决了二次复杂性问题,但仍然生成过多的视觉标记。然而,视觉标记中的冗余是主要问题,因为它导致了更多的计算量。为了缓解这一问题,我们提出了ConvLLaVA,它采用ConvNeXt作为LMM的视觉编码器,用以取代视觉Transformer(ViT)。ConvLLaVA将高分辨率图像压缩为信息丰富的视觉特征,有效地防止了生成过多的视觉标记。为了增强ConvLLaVA的功能,我们提出了两个关键优化。由于低分辨率的预训练ConvNeXt在直接应用于高分辨率时表现不佳,我们对其进行了更新以弥合差距。此外,由于ConvNeXt的原始压缩比对于更高分辨率的输入来说是不足够的,我们训练了一个连续阶段来进一步压缩视觉标记,从而减少冗余。这些优化使ConvLLaVA能够支持1536x1536分辨率的输入,仅生成576个视觉标记,能够处理任意宽高比的图像。实验结果表明,我们的方法在主流基准测试中达到了与最先进模型竞争的性能。ConvLLaVA模型系列可在https://github.com/alibaba/conv-llava 公开获取。
English
High-resolution Large Multimodal Models (LMMs) encounter the challenges of
excessive visual tokens and quadratic visual complexity. Current
high-resolution LMMs address the quadratic complexity while still generating
excessive visual tokens. However, the redundancy in visual tokens is the key
problem as it leads to more substantial compute. To mitigate this issue, we
propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the
visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses
high-resolution images into information-rich visual features, effectively
preventing the generation of excessive visual tokens. To enhance the
capabilities of ConvLLaVA, we propose two critical optimizations. Since the
low-resolution pretrained ConvNeXt underperforms when directly applied on high
resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original
compression ratio is inadequate for much higher resolution inputs, we train a
successive stage to further compress the visual tokens, thereby reducing
redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536
resolution generating only 576 visual tokens, capable of handling images of
arbitrary aspect ratios. Experimental results demonstrate that our method
achieves competitive performance with state-of-the-art models on mainstream
benchmarks. The ConvLLaVA model series are publicly available at
https://github.com/alibaba/conv-llava.Summary
AI-Generated Summary