GIMMICK —— 全球包容性多模態多任務文化知識基準測試
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
February 19, 2025
作者: Florian Schneider, Carolin Holtermann, Chris Biemann, Anne Lauscher
cs.AI
摘要
大型視覺語言模型(LVLMs)近期因其卓越性能與廣泛應用性而受到關注。儘管先前研究已表明,這些模型在涉及非西方情境的使用場景中效果欠佳,但現有研究範圍有限,僅涵蓋少數文化,專注於少量文化面向,或僅在單一任務上評估有限數量的模型。為推動全球包容性的LVLM研究,我們引入了GIMMICK,這是一個廣泛的多模態基準,旨在評估代表全球六大宏觀區域的144個國家的廣泛文化知識。GIMMICK包含基於三個新數據集構建的六項任務,涵蓋728個獨特的文化事件或面向,我們在此基礎上評估了20個LVLMs和11個LLMs,包括五個專有模型和26個各種規模的開源模型。我們系統性地考察了(1)區域文化偏見,(2)模型規模的影響,(3)輸入模態,以及(4)外部提示。我們的分析揭示了模型和任務中對西方文化的強烈偏見,並強調了模型規模與性能之間的強相關性,以及多模態輸入和外部地理提示的有效性。我們進一步發現,模型對有形文化元素(如食物)的知識多於無形元素(如儀式),並且在識別廣泛文化起源方面表現出色,但在更細緻的理解上則面臨挑戰。
English
Large Vision-Language Models (LVLMs) have recently gained attention due to
their distinctive performance and broad applicability. While it has been
previously shown that their efficacy in usage scenarios involving non-Western
contexts falls short, existing studies are limited in scope, covering just a
narrow range of cultures, focusing exclusively on a small number of cultural
aspects, or evaluating a limited selection of models on a single task only.
Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive
multimodal benchmark designed to assess a broad spectrum of cultural knowledge
across 144 countries representing six global macro-regions. GIMMICK comprises
six tasks built upon three new datasets that span 728 unique cultural events or
facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary
and 26 open-weight models of all sizes. We systematically examine (1) regional
cultural biases, (2) the influence of model size, (3) input modalities, and (4)
external cues. Our analyses reveal strong biases toward Western cultures across
models and tasks and highlight strong correlations between model size and
performance, as well as the effectiveness of multimodal input and external
geographic cues. We further find that models have more knowledge of tangible
than intangible aspects (e.g., food vs. rituals) and that they excel in
recognizing broad cultural origins but struggle with a more nuanced
understanding.Summary
AI-Generated Summary