Cobra: 効率的な推論のためのマルチモーダル大規模言語モデルへのMambaの拡張

要旨

近年、マルチモーダル大規模言語モデル（MLLM）の様々な分野への応用が目覚ましい成功を収めています。しかし、多くの下流タスクの基盤モデルとして、現在のMLLMはよく知られたTransformerネットワークで構成されており、計算複雑度が二次的で効率が低いという課題があります。このような基盤モデルの効率を改善するため、我々は線形計算複雑度のMLLMであるCobraを提案します。具体的には、Cobraは効率的なMamba言語モデルを視覚モダリティに統合しています。さらに、効果的なマルチモーダルMambaを構築するため、様々なモダリティ融合スキームを探求し研究しました。大規模な実験により、(1) CobraはLLaVA-Phi、TinyLLaVA、MobileVLM v2などの現在の計算効率の良い最先端手法と極めて競争力のある性能を達成し、Cobraの線形シーケンシャルモデリングにより高速な速度を実現していること、(2) 興味深いことに、閉集合の難易度の高い予測ベンチマークの結果から、Cobraは視覚的錯覚や空間関係の判断において優れた性能を発揮することが示されました。(3) 特に注目すべきは、CobraはLLaVAと比較して約43%のパラメータ数で同等の性能を達成していることです。我々はCobraの全コードをオープンソース化し、提案手法がMLLMの複雑性問題に関する将来の研究を促進することを期待しています。プロジェクトページは以下で公開されています: https://sites.google.com/view/cobravlm。

English

In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.

Cobra: 効率的な推論のためのマルチモーダル大規模言語モデルへのMambaの拡張

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

要旨

Support