EVLM:一個用於視覺理解的高效視覺語言模型
EVLM: An Efficient Vision-Language Model for Visual Understanding
July 19, 2024
作者: Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang
cs.AI
摘要
在多模態語言模型領域中,大多數方法都建立在類似 LLaVA 的架構上。這些模型使用單層 ViT 特徵作為視覺提示,直接將其與文本標記一起餵入語言模型。然而,當處理長序列的視覺信號或輸入(如視頻)時,語言模型的自注意機制可能導致顯著的計算開銷。此外,使用單層 ViT 特徵使大型語言模型難以充分感知視覺信號。本文提出了一種高效的多模態語言模型,以最小化計算成本,同時使模型盡可能全面地感知視覺信號。我們的方法主要包括:(1)採用與 Flamingo 相似的圖像-文本交互的交叉注意力。 (2)利用分層 ViT 特徵。 (3)引入專家混合(MoE)機制以增強模型效能。我們的模型在公開的多模態基準測試中取得了競爭力的分數,在圖像標題生成和視頻標題生成等任務中表現良好。
English
In the field of multi-modal language models, the majority of methods are
built on an architecture similar to LLaVA. These models use a single-layer ViT
feature as a visual prompt, directly feeding it into the language models
alongside textual tokens. However, when dealing with long sequences of visual
signals or inputs such as videos, the self-attention mechanism of language
models can lead to significant computational overhead. Additionally, using
single-layer ViT features makes it challenging for large language models to
perceive visual signals fully. This paper proposes an efficient multi-modal
language model to minimize computational costs while enabling the model to
perceive visual signals as comprehensively as possible. Our method primarily
includes: (1) employing cross-attention to image-text interaction similar to
Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of
Experts (MoE) mechanism to enhance model effectiveness. Our model achieves
competitive scores on public multi-modal benchmarks and performs well in tasks
such as image captioning and video captioning.Summary
AI-Generated Summary