WorldCuisines:一个用于全球美食的多语言和多文化视觉问答的大规模基准测试。
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
October 16, 2024
作者: Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia, Jan Christian Blaise Cruz, Jan Wira Gotama Putra, Junho Myung, Lucky Susanto, Maria Angelica Riera Machin, Marina Zhukova, Michael Anugraha, Muhammad Farid Adilazuarda, Natasha Santosa, Peerat Limkonchotiwat, Raj Dabre, Rio Alexander Audino, Samuel Cahyawijaya, Shi-Xiong Zhang, Stephanie Yulia Salim, Yi Zhou, Yinxuan Gui, David Ifeoluwa Adelani, En-Shiun Annie Lee, Shogo Okada, Ayu Purwarianti, Alham Fikri Aji, Taro Watanabe, Derry Tanti Wijaya, Alice Oh, Chong-Wah Ngo
cs.AI
摘要
视觉语言模型(VLMs)通常在处理特定文化知识时遇到困难,尤其是在英语以外的语言和少数文化背景中。为了评估它们对这种知识的理解能力,我们引入了WorldCuisines,这是一个大规模的跨语言和跨文化、以视觉为基础的语言理解基准。该基准包括一个视觉问答(VQA)数据集,涵盖30种语言和方言,跨越9个语言家族,包含超过100万个数据点,是迄今为止最大的多元文化VQA基准。它包括识别菜名及其起源的任务。我们提供了两个规模的评估数据集(12k和60k实例),以及一个训练数据集(100万实例)。我们的研究结果显示,虽然VLMs在正确的位置上下文中表现更好,但在对抗性环境和预测特定地区的菜肴和语言方面仍然存在困难。为了支持未来的研究,我们发布了一个带有注释食品条目和图像的知识库,以及VQA数据。
English
Vision Language Models (VLMs) often struggle with culture-specific knowledge,
particularly in languages other than English and in underrepresented cultural
contexts. To evaluate their understanding of such knowledge, we introduce
WorldCuisines, a massive-scale benchmark for multilingual and multicultural,
visually grounded language understanding. This benchmark includes a visual
question answering (VQA) dataset with text-image pairs across 30 languages and
dialects, spanning 9 language families and featuring over 1 million data
points, making it the largest multicultural VQA benchmark to date. It includes
tasks for identifying dish names and their origins. We provide evaluation
datasets in two sizes (12k and 60k instances) alongside a training dataset (1
million instances). Our findings show that while VLMs perform better with
correct location context, they struggle with adversarial contexts and
predicting specific regional cuisines and languages. To support future
research, we release a knowledge base with annotated food entries and images
along with the VQA data.Summary
AI-Generated Summary