MVL-SIB: クロスモーダルトピックマッチングのための大規模多言語視覚言語ベンチマーク

要旨

既存の多言語視覚言語（VL）ベンチマークは、しばしば限られた言語しかカバーしていない。その結果、大規模視覚言語モデル（LVLM）の評価は主に高リソース言語を対象としており、低リソース言語の評価データの必要性が強調されている。この制約に対処するため、我々はMVL-SIBを導入する。これは205の言語にわたるクロスモーダルおよびテキストのみのトピックマッチングを評価する大規模多言語視覚言語ベンチマークであり、既存の最も多言語なVLベンチマークよりも100言語以上多い。次に、我々は一連のオープンウェイトLVLMとGPT-4o(-mini)をMVL-SIBでベンチマークした。その結果、LVLMは低リソース言語におけるクロスモーダルトピックマッチングに苦戦し、N'Kooのような言語ではランダムな性能を超えられないことが明らかになった。さらに、クロスモーダルとテキストのみのトピックマッチング性能の比較により、LVLMのVLサポートは低リソース言語においてテキストサポートに比べて不均衡に低下することが示された。また、オープンウェイトLVLMは、トピックを複数の画像で表現しても性能向上が見られず、これらのモデルがまだマルチイメージタスクを十分に効果的に処理できていないことが示唆された。MVL-SIBの性能を他の多言語VLベンチマークと相関させることで、MVL-SIBがLVLMの多言語VL理解を包括的に探るプローブとして機能することを強調する。

English

Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.