Xmodel-VLM: Ein einfacher Ausgangspunkt für multimodale Vision-Sprach-Modelle

papers.abstract

Wir stellen Xmodel-VLM vor, ein hochmodernes multimodales Vision-Sprachmodell. Es ist für eine effiziente Bereitstellung auf Consumer-GPU-Servern konzipiert. Unsere Arbeit befasst sich direkt mit einem entscheidenden Branchenproblem, indem sie sich mit den prohibitiven Servicekosten auseinandersetzt, die der breiten Akzeptanz von groß angelegten multimodalen Systemen im Wege stehen. Durch ein rigoroses Training haben wir ein Sprachmodell im Maßstab von 1B von Grund auf entwickelt, wobei wir das LLaVA-Paradigma für die Modalitätsausrichtung verwenden. Das Ergebnis, das wir Xmodel-VLM nennen, ist ein leichtgewichtiges, aber leistungsstarkes multimodales Vision-Sprachmodell. Umfangreiche Tests über zahlreiche klassische multimodale Benchmarks haben gezeigt, dass Xmodel-VLM trotz seiner geringeren Größe und schnelleren Ausführung eine Leistung bietet, die mit der größerer Modelle vergleichbar ist. Unsere Modell-Checkpoints und der Code sind öffentlich auf GitHub unter https://github.com/XiaoduoAILab/XmodelVLM verfügbar.

English

We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

Xmodel-VLM: Ein einfacher Ausgangspunkt für multimodale Vision-Sprach-Modelle

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

papers.abstract

Support