ChatPaper.aiChatPaper

MobileVLM:一款快速、可重現且功能強大的移動設備視覺語言助手

MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices

December 28, 2023
作者: Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen
cs.AI

摘要

我們提出了MobileVLM,這是一個針對在行動裝置上運行的多模式視覺語言模型(MMVLM)。它是許多針對行動裝置設計的架構設計和技術的結合,包括一組規模為1.4B和2.7B參數的語言模型,從頭開始訓練,以CLIP方式預先訓練的多模式視覺模型,以及通過高效投影機實現的跨模式交互。我們在幾個典型的VLM基準測試上評估了MobileVLM。我們的模型表現與一些更大的模型相當。更重要的是,我們在Qualcomm Snapdragon 888 CPU和NVIDIA Jeston Orin GPU上測量了推理速度,分別獲得了每秒21.5個標記和65.3個標記的最新性能。我們的程式碼將在以下網址提供:https://github.com/Meituan-AutoML/MobileVLM。
English
We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.
PDF212December 15, 2024