ConvLLaVA: 大規模マルチモーダルモデルのための視覚エンコーダとしての階層型バックボーン

要旨

高解像度の大規模マルチモーダルモデル（LMM）は、過剰な視覚トークンと二次的な視覚的複雑性という課題に直面しています。現在の高解像度LMMは、二次的な複雑性に対処しながらも、依然として過剰な視覚トークンを生成します。しかし、視覚トークンの冗長性が主要な問題であり、これがより大きな計算負荷を引き起こします。この問題を緩和するため、我々はConvLLaVAを提案します。ConvLLaVAは、Vision Transformer（ViT）の代わりに、階層型バックボーンであるConvNeXtをLMMの視覚エンコーダとして採用します。ConvLLaVAは、高解像度画像を情報豊富な視覚特徴に圧縮し、過剰な視覚トークンの生成を効果的に防ぎます。ConvLLaVAの能力を向上させるため、我々は2つの重要な最適化を提案します。低解像度で事前学習されたConvNeXtは、高解像度に直接適用すると性能が低下するため、このギャップを埋めるために更新します。さらに、ConvNeXtの元の圧縮率は、より高解像度の入力に対して不十分であるため、視覚トークンをさらに圧縮するための連続ステージを訓練し、冗長性を削減します。これらの最適化により、ConvLLaVAは1536x1536解像度の入力に対して576個の視覚トークンしか生成せず、任意のアスペクト比の画像を処理可能です。実験結果は、我々の手法が主流のベンチマークにおいて最先端のモデルと競争力のある性能を達成することを示しています。ConvLLaVAモデルシリーズは、https://github.com/alibaba/conv-llava で公開されています。

English

High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.

ConvLLaVA: 大規模マルチモーダルモデルのための視覚エンコーダとしての階層型バックボーン

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

要旨

Support