ChatPaper.aiChatPaper

Xmodel-VLM:一种用于多模态视觉语言模型的简单基准

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

May 15, 2024
作者: Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang
cs.AI

摘要

我们介绍了Xmodel-VLM,这是一种尖端的多模态视觉语言模型。它旨在能够高效部署在消费级GPU服务器上。我们的工作直接应对了一个关键的行业问题,即应对阻碍大规模多模态系统广泛采用的高昂服务成本。通过严格的训练,我们从零开始开发了一个规模为10亿的语言模型,采用了LLaVA范式进行模态对齐。我们称之为Xmodel-VLM的结果是一个轻量而强大的多模态视觉语言模型。通过在众多经典多模态基准测试中进行广泛测试,我们发现,尽管尺寸较小且执行速度更快,Xmodel-VLM的性能与较大模型相当。我们的模型检查点和代码已公开在GitHub上提供,网址为https://github.com/XiaoduoAILab/XmodelVLM。
English
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

Summary

AI-Generated Summary

PDF231December 15, 2024