ConvLLaVA：作为大型多模态模型的视觉编码器的分层骨干

摘要

高分辨率大型多模态模型（LMMs）面临着过多的视觉标记和二次视觉复杂性的挑战。当前的高分辨率LMMs解决了二次复杂性问题，但仍然生成过多的视觉标记。然而，视觉标记中的冗余是主要问题，因为它导致了更多的计算量。为了缓解这一问题，我们提出了ConvLLaVA，它采用ConvNeXt作为LMM的视觉编码器，用以取代视觉Transformer（ViT）。ConvLLaVA将高分辨率图像压缩为信息丰富的视觉特征，有效地防止了生成过多的视觉标记。为了增强ConvLLaVA的功能，我们提出了两个关键优化。由于低分辨率的预训练ConvNeXt在直接应用于高分辨率时表现不佳，我们对其进行了更新以弥合差距。此外，由于ConvNeXt的原始压缩比对于更高分辨率的输入来说是不足够的，我们训练了一个连续阶段来进一步压缩视觉标记，从而减少冗余。这些优化使ConvLLaVA能够支持1536x1536分辨率的输入，仅生成576个视觉标记，能够处理任意宽高比的图像。实验结果表明，我们的方法在主流基准测试中达到了与最先进模型竞争的性能。ConvLLaVA模型系列可在https://github.com/alibaba/conv-llava 公开获取。

English

High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.

ConvLLaVA：作为大型多模态模型的视觉编码器的分层骨干

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

摘要

Support