在Maya背後:構建多語言視覺語言模型
Behind Maya: Building a Multilingual Vision Language Model
May 13, 2025
作者: Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, Alham Fikri Aji
cs.AI
摘要
近年來,我們見證了大規模視覺-語言模型(VLMs)的快速發展。這些模型在學術基準測試中展現了令人印象深刻的成果,主要集中於廣泛使用的語言,但在低資源語言和多樣文化情境下的表現則有所不足。為解決這些限制,我們推出了Maya,一個開源的多語言視覺-語言模型。我們的主要貢獻包括:1)基於LLaVA預訓練數據集,構建了一個涵蓋八種語言的多語言圖像-文本預訓練數據集;以及2)支持這些語言的多語言圖像-文本模型,提升了視覺-語言任務中的文化與語言理解能力。代碼已開源於https://github.com/nahidalam/maya。
English
In recent times, we have seen a rapid development of large Vision-Language
Models (VLMs). They have shown impressive results on academic benchmarks,
primarily in widely spoken languages but lack performance on low-resource
languages and varied cultural contexts. To address these limitations, we
introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a
multilingual image-text pretraining dataset in eight languages, based on the
LLaVA pretraining dataset; and 2) a multilingual image-text model supporting
these languages, enhancing cultural and linguistic comprehension in
vision-language tasks. Code available at https://github.com/nahidalam/maya.Summary
AI-Generated Summary