EVLM：視覚理解のための効率的な視覚言語モデル

要旨

マルチモーダル言語モデルの分野では、ほとんどの手法がLLaVAに類似したアーキテクチャを基盤としています。これらのモデルは、単層ViTの特徴量を視覚プロンプトとして使用し、それをテキストトークンと共に直接言語モデルに入力します。しかし、ビデオのような長い視覚信号シーケンスや入力を扱う場合、言語モデルの自己注意機構は計算コストの大幅な増加を引き起こす可能性があります。さらに、単層ViTの特徴量を使用すると、大規模言語モデルが視覚信号を十分に認識することが困難になります。本論文では、計算コストを最小限に抑えつつ、モデルが視覚信号を可能な限り包括的に認識できる効率的なマルチモーダル言語モデルを提案します。我々の手法は主に以下の要素を含みます：(1) Flamingoと同様の画像-テキスト相互作用にクロスアテンションを採用、(2) 階層型ViT特徴量の利用、(3) モデルの効果を高めるためのMixture of Experts (MoE) メカニズムの導入。我々のモデルは、公開されているマルチモーダルベンチマークで競争力のあるスコアを達成し、画像キャプション生成やビデオキャプション生成などのタスクで良好な性能を発揮します。

English

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

EVLM：視覚理解のための効率的な視覚言語モデル

EVLM: An Efficient Vision-Language Model for Visual Understanding

要旨

Support