MobileVLM V2:视觉语言模型的更快更强基线
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
February 6, 2024
作者: Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen
cs.AI
摘要
我们介绍了MobileVLM V2,这是在MobileVLM基础上显著改进的一系列视觉语言模型,证明了新颖的架构设计、专为移动VLM定制的改进训练方案以及丰富高质量数据集的精心策划可以大幅提升VLM的性能。具体来说,MobileVLM V2 1.7B在标准VLM基准测试中取得了更好或与规模为3B的更大VLM性能相当的表现。值得注意的是,我们的3B模型在7B+规模上表现优于大量VLM。我们的模型将在https://github.com/Meituan-AutoML/MobileVLM 上发布。
English
We introduce MobileVLM V2, a family of significantly improved vision language
models upon MobileVLM, which proves that a delicate orchestration of novel
architectural design, an improved training scheme tailored for mobile VLMs, and
rich high-quality dataset curation can substantially benefit VLMs' performance.
Specifically, MobileVLM V2 1.7B achieves better or on-par performance on
standard VLM benchmarks compared with much larger VLMs at the 3B scale.
Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our
models will be released at https://github.com/Meituan-AutoML/MobileVLM .