Maya背后：构建多语言视觉语言模型

摘要

近期，大规模视觉-语言模型（VLMs）发展迅猛。这些模型在学术基准测试中展现了令人瞩目的成果，主要集中在广泛使用的语言上，但在低资源语言和多元文化背景下的表现则显不足。为应对这些局限，我们推出了Maya，一个开源的多语言视觉-语言模型。我们的贡献包括：1）基于LLaVA预训练数据集构建的八种语言的多语言图文预训练数据集；2）支持这些语言的多语言图文模型，旨在提升视觉-语言任务中的文化与语言理解能力。代码已发布于https://github.com/nahidalam/maya。

English

In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.

Maya背后：构建多语言视觉语言模型

Behind Maya: Building a Multilingual Vision Language Model

摘要

Support