모바일VLM: 모바일 기기를 위한 빠르고 재현 가능하며 강력한 시각 언어 어시스턴트

초록

본 논문에서는 모바일 기기에서 실행 가능한 다중 모달 비전 언어 모델(MMVLM)인 MobileVLM을 소개한다. MobileVLM은 모바일 환경에 최적화된 다양한 아키텍처 설계와 기술을 통합한 모델로, 1.4B 및 2.7B 파라미터 규모의 언어 모델 세트, CLIP 방식으로 사전 학습된 다중 모달 비전 모델, 그리고 효율적인 프로젝터를 통한 교차 모달리티 상호작용으로 구성된다. MobileVLM은 여러 전형적인 VLM 벤치마크에서 평가되었으며, 훨씬 더 큰 규모의 모델들과 비교해도 동등한 성능을 보여준다. 특히, Qualcomm Snapdragon 888 CPU와 NVIDIA Jetson Orin GPU에서의 추론 속도를 측정한 결과, 각각 초당 21.5 토큰과 65.3 토큰이라는 최첨단 성능을 달성하였다. 본 모델의 코드는 https://github.com/Meituan-AutoML/MobileVLM에서 공개될 예정이다.

English

We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

모바일VLM: 모바일 기기를 위한 빠르고 재현 가능하며 강력한 시각 언어 어시스턴트

MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices

초록

Support