EVLM: 시각 이해를 위한 효율적인 비전-언어 모델

초록

다중모달 언어 모델 분야에서 대부분의 방법은 LLaVA와 유사한 아키텍처를 기반으로 구축됩니다. 이러한 모델은 단일 계층 ViT(Vision Transformer) 특징을 시각적 프롬프트로 사용하여 이를 텍스트 토큰과 함께 언어 모델에 직접 입력합니다. 그러나 비디오와 같은 긴 시각적 신호 또는 입력을 다룰 때, 언어 모델의 자기 주의(self-attention) 메커니즘은 상당한 계산 오버헤드를 초래할 수 있습니다. 또한 단일 계층 ViT 특징을 사용하면 대형 언어 모델이 시각적 신호를 완전히 인지하기 어렵습니다. 본 논문은 계산 비용을 최소화하면서 모델이 시각적 신호를 최대한 포괄적으로 인지할 수 있는 효율적인 다중모달 언어 모델을 제안합니다. 우리의 방법은 주로 다음과 같은 요소를 포함합니다: (1) Flamingo와 유사한 이미지-텍스트 상호작용을 위한 교차 주의(cross-attention) 사용, (2) 계층적 ViT 특징 활용, (3) 모델 효과성을 향상시키기 위한 전문가 혼합(Mixture of Experts, MoE) 메커니즘 도입. 우리의 모델은 공개된 다중모달 벤치마크에서 경쟁력 있는 점수를 달성하며, 이미지 캡셔닝 및 비디오 캡셔닝과 같은 작업에서 우수한 성능을 보입니다.

English

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

EVLM: 시각 이해를 위한 효율적인 비전-언어 모델

EVLM: An Efficient Vision-Language Model for Visual Understanding

초록

Support