코브라: 효율적인 추론을 위한 멀티모달 대형 언어 모델로 Mamba 확장하기

초록

최근 몇 년간 다양한 분야에서 멀티모달 대형 언어 모델(MLLM)의 적용이 놀라운 성공을 거두었습니다. 그러나 많은 다운스트림 작업의 기반 모델로서, 현재의 MLLM은 잘 알려진 Transformer 네트워크로 구성되어 있으며, 이는 덜 효율적인 2차 계산 복잡도를 가지고 있습니다. 이러한 기본 모델의 효율성을 개선하기 위해, 우리는 선형 계산 복잡도의 MLLM인 Cobra를 제안합니다. 구체적으로, Cobra는 효율적인 Mamba 언어 모델을 시각 모달리티에 통합합니다. 또한, 우리는 다양한 모달리티 융합 방식을 탐구하고 연구하여 효과적인 멀티모달 Mamba를 생성합니다. 광범위한 실험을 통해 (1) Cobra는 현재의 계산 효율적인 최신 방법들(예: LLaVA-Phi, TinyLLaVA, MobileVLM v2)과 매우 경쟁력 있는 성능을 달성하며, Cobra의 선형 순차 모델링 덕분에 더 빠른 속도를 보여줍니다. (2) 흥미롭게도, 폐쇄형 도전 예측 벤치마크 결과는 Cobra가 시각적 착각과 공간 관계 판단을 극복하는 데 잘 작동함을 보여줍니다. (3) 특히, Cobra는 LLaVA와 비교하여 약 43%의 파라미터 수로도 비슷한 성능을 달성합니다. 우리는 Cobra의 모든 코드를 오픈소스로 공개할 것이며, 제안된 방법이 MLLM의 복잡성 문제에 대한 미래 연구를 촉진하기를 바랍니다. 우리의 프로젝트 페이지는 https://sites.google.com/view/cobravlm에서 확인할 수 있습니다.

English

In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.

코브라: 효율적인 추론을 위한 멀티모달 대형 언어 모델로 Mamba 확장하기

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

초록

Support