MobileVLM V2: 더 빠르고 강력한 비전 언어 모델 베이스라인

초록

본 논문에서는 MobileVLM을 기반으로 크게 개선된 비전 언어 모델(Vision Language Model, VLM) 패밀리인 MobileVLM V2를 소개한다. 이는 새로운 아키텍처 설계, 모바일 VLM에 맞춰 개선된 학습 기법, 그리고 풍부하고 고품질의 데이터셋 구축이 VLM의 성능을 크게 향상시킬 수 있음을 입증한다. 구체적으로, MobileVLM V2 1.7B는 3B 규모의 훨씬 더 큰 VLM들과 비교하여 표준 VLM 벤치마크에서 동등하거나 더 나은 성능을 달성한다. 특히, 우리의 3B 모델은 7B+ 규모의 다양한 VLM들을 능가한다. 본 모델은 https://github.com/Meituan-AutoML/MobileVLM 에서 공개될 예정이다.

English

We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

MobileVLM V2: 더 빠르고 강력한 비전 언어 모델 베이스라인

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

초록

Support