GIMMICK —— 全球包容性多模態多任務文化知識基準測試

摘要

大型視覺語言模型（LVLMs）近期因其卓越性能與廣泛應用性而受到關注。儘管先前研究已表明，這些模型在涉及非西方情境的使用場景中效果欠佳，但現有研究範圍有限，僅涵蓋少數文化，專注於少量文化面向，或僅在單一任務上評估有限數量的模型。為推動全球包容性的LVLM研究，我們引入了GIMMICK，這是一個廣泛的多模態基準，旨在評估代表全球六大宏觀區域的144個國家的廣泛文化知識。GIMMICK包含基於三個新數據集構建的六項任務，涵蓋728個獨特的文化事件或面向，我們在此基礎上評估了20個LVLMs和11個LLMs，包括五個專有模型和26個各種規模的開源模型。我們系統性地考察了（1）區域文化偏見，（2）模型規模的影響，（3）輸入模態，以及（4）外部提示。我們的分析揭示了模型和任務中對西方文化的強烈偏見，並強調了模型規模與性能之間的強相關性，以及多模態輸入和外部地理提示的有效性。我們進一步發現，模型對有形文化元素（如食物）的知識多於無形元素（如儀式），並且在識別廣泛文化起源方面表現出色，但在更細緻的理解上則面臨挑戰。

English

Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.

GIMMICK —— 全球包容性多模態多任務文化知識基準測試

GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

摘要

Support