ChatPaper.aiChatPaper

EVLM:一個用於視覺理解的高效視覺語言模型

EVLM: An Efficient Vision-Language Model for Visual Understanding

July 19, 2024
作者: Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang
cs.AI

摘要

在多模態語言模型領域中,大多數方法都建立在類似 LLaVA 的架構上。這些模型使用單層 ViT 特徵作為視覺提示,直接將其與文本標記一起餵入語言模型。然而,當處理長序列的視覺信號或輸入(如視頻)時,語言模型的自注意機制可能導致顯著的計算開銷。此外,使用單層 ViT 特徵使大型語言模型難以充分感知視覺信號。本文提出了一種高效的多模態語言模型,以最小化計算成本,同時使模型盡可能全面地感知視覺信號。我們的方法主要包括:(1)採用與 Flamingo 相似的圖像-文本交互的交叉注意力。 (2)利用分層 ViT 特徵。 (3)引入專家混合(MoE)機制以增強模型效能。我們的模型在公開的多模態基準測試中取得了競爭力的分數,在圖像標題生成和視頻標題生成等任務中表現良好。
English
In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

Summary

AI-Generated Summary

PDF455November 28, 2024