MobileVLM V2:視覺語言模型的更快更強基線
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
February 6, 2024
作者: Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen
cs.AI
摘要
我們介紹了MobileVLM V2,這是在MobileVLM基礎上顯著改進的視覺語言模型系列,證明了對於移動VLM而言,新穎的架構設計、針對移動VLM量身定制的改進訓練方案,以及豐富高質量的數據集編輯可以顯著提升VLM的性能。具體來說,MobileVLM V2 1.7B在標準VLM基準測試中取得了比3B規模的更大VLM表現更好或相當的成績。值得注意的是,我們的3B模型在7B+規模上表現優於眾多VLM。我們的模型將在https://github.com/Meituan-AutoML/MobileVLM 上發布。
English
We introduce MobileVLM V2, a family of significantly improved vision language
models upon MobileVLM, which proves that a delicate orchestration of novel
architectural design, an improved training scheme tailored for mobile VLMs, and
rich high-quality dataset curation can substantially benefit VLMs' performance.
Specifically, MobileVLM V2 1.7B achieves better or on-par performance on
standard VLM benchmarks compared with much larger VLMs at the 3B scale.
Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our
models will be released at https://github.com/Meituan-AutoML/MobileVLM .