Xmodel-VLM:多模式視覺語言模型的簡單基準
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
May 15, 2024
作者: Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang
cs.AI
摘要
我們介紹了Xmodel-VLM,一款尖端的多模態視覺語言模型。它旨在能夠高效部署在消費者GPU伺服器上。我們的工作直接應對了一個關鍵的行業問題,即應對阻礙大規模多模態系統廣泛應用的高昂服務成本。通過嚴格的訓練,我們從頭開始開發了一個10億規模的語言模型,採用了LLaVA範式進行模態對齊。我們稱之為Xmodel-VLM的結果是一個輕量而強大的多模態視覺語言模型。在眾多經典多模態基準測試中進行了廣泛測試,結果顯示,儘管尺寸較小且執行速度更快,Xmodel-VLM的性能與較大模型相當。我們的模型檢查點和代碼已公開在GitHub上,網址為https://github.com/XiaoduoAILab/XmodelVLM。
English
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It
is designed for efficient deployment on consumer GPU servers. Our work directly
confronts a pivotal industry issue by grappling with the prohibitive service
costs that hinder the broad adoption of large-scale multimodal systems. Through
rigorous training, we have developed a 1B-scale language model from the ground
up, employing the LLaVA paradigm for modal alignment. The result, which we call
Xmodel-VLM, is a lightweight yet powerful multimodal vision language model.
Extensive testing across numerous classic multimodal benchmarks has revealed
that despite its smaller size and faster execution, Xmodel-VLM delivers
performance comparable to that of larger models. Our model checkpoints and code
are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.Summary
AI-Generated Summary