ChatPaper.aiChatPaper

ConvLLaVA:Hierarchical Backbones作為大型多模型模型的視覺編碼器

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

May 24, 2024
作者: Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng
cs.AI

摘要

高解析度的大型多模型模型(LMM)面臨著過多的視覺標記和二次視覺複雜性的挑戰。目前的高解析度LMM解決了二次複雜性問題,但仍然生成過多的視覺標記。然而,視覺標記中的冗餘是主要問題,因為它導致了更多的計算量。為了緩解這個問題,我們提出了ConvLLaVA,它採用ConvNeXt作為LMM的視覺編碼器,以取代Vision Transformer(ViT)。ConvLLaVA將高解析度圖像壓縮為信息豐富的視覺特徵,有效地防止生成過多的視覺標記。為了增強ConvLLaVA的能力,我們提出了兩個關鍵優化。由於低解析度預訓練的ConvNeXt在直接應用於高解析度時表現不佳,我們對其進行了更新以彌合差距。此外,由於ConvNeXt的原始壓縮比對於更高解析度的輸入來說是不足夠的,我們訓練了一個連續階段來進一步壓縮視覺標記,從而減少冗餘。這些優化使ConvLLaVA能夠支持1536x1536解析度的輸入,僅生成576個視覺標記,能夠處理任意長寬比的圖像。實驗結果表明,我們的方法在主流基準測試中實現了與最先進模型競爭力相當的性能。ConvLLaVA模型系列可在https://github.com/alibaba/conv-llava 公開獲得。
English
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.

Summary

AI-Generated Summary

PDF477December 15, 2024