MobileVLM: Un Asistente de Visión y Lenguaje Rápido, Reproducible y Potente para Dispositivos Móviles

Resumen

Presentamos MobileVLM, un modelo multimodal de visión y lenguaje (MMVLM) competente diseñado para ejecutarse en dispositivos móviles. Es una amalgama de una variedad de diseños arquitectónicos y técnicas orientadas a móviles, que incluye un conjunto de modelos de lenguaje con escalas de 1.4B y 2.7B parámetros, entrenados desde cero, un modelo de visión multimodal preentrenado al estilo CLIP, y una interacción entre modalidades mediante un proyector eficiente. Evaluamos MobileVLM en varios benchmarks típicos de VLM. Nuestros modelos demuestran un rendimiento comparable con algunos modelos mucho más grandes. Más importante aún, medimos la velocidad de inferencia tanto en una CPU Qualcomm Snapdragon 888 como en una GPU NVIDIA Jetson Orin, y obtenemos un rendimiento de vanguardia de 21.5 y 65.3 tokens por segundo, respectivamente. Nuestro código estará disponible en: https://github.com/Meituan-AutoML/MobileVLM.

English

We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

MobileVLM: Un Asistente de Visión y Lenguaje Rápido, Reproducible y Potente para Dispositivos Móviles

MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices

Resumen

Support