EVLM：一個用於視覺理解的高效視覺語言模型

摘要

在多模態語言模型領域中，大多數方法都建立在類似 LLaVA 的架構上。這些模型使用單層 ViT 特徵作為視覺提示，直接將其與文本標記一起餵入語言模型。然而，當處理長序列的視覺信號或輸入（如視頻）時，語言模型的自注意機制可能導致顯著的計算開銷。此外，使用單層 ViT 特徵使大型語言模型難以充分感知視覺信號。本文提出了一種高效的多模態語言模型，以最小化計算成本，同時使模型盡可能全面地感知視覺信號。我們的方法主要包括：（1）採用與 Flamingo 相似的圖像-文本交互的交叉注意力。（2）利用分層 ViT 特徵。（3）引入專家混合（MoE）機制以增強模型效能。我們的模型在公開的多模態基準測試中取得了競爭力的分數，在圖像標題生成和視頻標題生成等任務中表現良好。

English

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

EVLM：一個用於視覺理解的高效視覺語言模型

EVLM: An Efficient Vision-Language Model for Visual Understanding

摘要

Support