MVL-SIB:一個大規模多語種視覺-語言基準,用於跨模態主題匹配
MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
February 18, 2025
作者: Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš
cs.AI
摘要
現有的多語言視覺-語言(VL)基準通常僅涵蓋少數幾種語言。因此,對大型視覺-語言模型(LVLMs)的評估主要針對高資源語言,這凸顯了對低資源語言評估數據的需求。為解決這一限制,我們引入了MVL-SIB,這是一個大規模多語言視覺-語言基準,評估了205種語言的跨模態和純文本主題匹配——比現有最為多語言的VL基準多出100多種語言。我們隨後在MVL-SIB上對一系列開源權重的LVLMs以及GPT-4o(-mini)進行了基準測試。我們的結果顯示,LVLMs在低資源語言的跨模態主題匹配上表現不佳,對於像N'Koo這樣的語言,其表現甚至不優於隨機猜測。我們的分析進一步揭示,相對於文本支持,LVLMs在低資源語言中的VL支持下降得不成比例,這通過跨模態與純文本主題匹配性能的比較得以證實。我們還觀察到,開源權重的LVLMs並未從使用多張圖像表示主題中獲益,這表明這些模型在處理多圖像任務方面尚未完全有效。通過將MVL-SIB上的性能與其他多語言VL基準進行關聯,我們強調了MVL-SIB作為全面探測LVLMs多語言VL理解能力的工具。
English
Existing multilingual vision-language (VL) benchmarks often only cover a
handful of languages. Consequently, evaluations of large vision-language models
(LVLMs) predominantly target high-resource languages, underscoring the need for
evaluation data for low-resource languages. To address this limitation, we
introduce MVL-SIB, a massively multilingual vision-language benchmark that
evaluates both cross-modal and text-only topical matching across 205 languages
-- over 100 more than the most multilingual existing VL benchmarks encompass.
We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini)
on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic
matching in lower-resource languages, performing no better than chance on
languages like N'Koo. Our analysis further reveals that VL support in LVLMs
declines disproportionately relative to textual support for lower-resource
languages, as evidenced by comparison of cross-modal and text-only topical
matching performance. We further observe that open-weight LVLMs do not benefit
from representing a topic with more than one image, suggesting that these
models are not yet fully effective at handling multi-image tasks. By
correlating performance on MVL-SIB with other multilingual VL benchmarks, we
highlight that MVL-SIB serves as a comprehensive probe of multilingual VL
understanding in LVLMs.Summary
AI-Generated Summary