EVLM:一种用于视觉理解的高效视觉-语言模型
EVLM: An Efficient Vision-Language Model for Visual Understanding
July 19, 2024
作者: Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang
cs.AI
摘要
在多模态语言模型领域,大多数方法都建立在类似于LLaVA的架构上。这些模型使用单层ViT特征作为视觉提示,直接将其与文本标记一起馈送到语言模型中。然而,当处理长序列的视觉信号或输入(如视频)时,语言模型的自注意机制可能导致显着的计算开销。此外,使用单层ViT特征使大型语言模型完全感知视觉信号变得具有挑战性。本文提出了一种高效的多模态语言模型,以最小化计算成本,同时使模型尽可能全面地感知视觉信号。我们的方法主要包括:(1)类似于Flamingo的图像-文本交互的交叉注意力机制。(2) 利用分层ViT特征。(3) 引入专家混合(MoE)机制以增强模型效果。我们的模型在公共多模态基准测试中取得了竞争性分数,并在诸如图像字幕和视频字幕等任务中表现良好。
English
In the field of multi-modal language models, the majority of methods are
built on an architecture similar to LLaVA. These models use a single-layer ViT
feature as a visual prompt, directly feeding it into the language models
alongside textual tokens. However, when dealing with long sequences of visual
signals or inputs such as videos, the self-attention mechanism of language
models can lead to significant computational overhead. Additionally, using
single-layer ViT features makes it challenging for large language models to
perceive visual signals fully. This paper proposes an efficient multi-modal
language model to minimize computational costs while enabling the model to
perceive visual signals as comprehensively as possible. Our method primarily
includes: (1) employing cross-attention to image-text interaction similar to
Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of
Experts (MoE) mechanism to enhance model effectiveness. Our model achieves
competitive scores on public multi-modal benchmarks and performs well in tasks
such as image captioning and video captioning.Summary
AI-Generated Summary