MobileVLM:一款快速、可复现且强大的移动设备视觉语言助手
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
December 28, 2023
作者: Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen
cs.AI
摘要
我们提出了MobileVLM,这是一个专为在移动设备上运行的多模态视觉语言模型(MMVLM)。它是各种移动设备导向的架构设计和技术的融合,包括一组规模为1.4B和2.7B参数的语言模型,从头开始训练,以CLIP方式预训练的多模态视觉模型,通过高效的投影仪进行跨模态交互。我们在几个典型的VLM基准测试上评估了MobileVLM。我们的模型表现与一些规模更大的模型相当。更重要的是,我们在高通骁龙888 CPU和英伟达Jeston Orin GPU上测量了推理速度,并分别获得了每秒21.5个标记和65.3个标记的最新性能。我们的代码将在以下网址提供:https://github.com/Meituan-AutoML/MobileVLM。
English
We present MobileVLM, a competent multimodal vision language model (MMVLM)
targeted to run on mobile devices. It is an amalgamation of a myriad of
architectural designs and techniques that are mobile-oriented, which comprises
a set of language models at the scale of 1.4B and 2.7B parameters, trained from
scratch, a multimodal vision model that is pre-trained in the CLIP fashion,
cross-modality interaction via an efficient projector. We evaluate MobileVLM on
several typical VLM benchmarks. Our models demonstrate on par performance
compared with a few much larger models. More importantly, we measure the
inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin
GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens
per second, respectively. Our code will be made available at:
https://github.com/Meituan-AutoML/MobileVLM.