CVQA:具有跨文化多语言特点的视觉问答基准。
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
June 10, 2024
作者: David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song, Henok Biadglign Ademtew, Hernán Maina, Holy Lovenia, Israel Abebe Azime, Jan Christian Blaise Cruz, Jay Gala, Jiahui Geng, Jesus-German Ortiz-Barajas, Jinheon Baek, Jocelyn Dunstan, Laura Alonso Alemany, Kumaranage Ravindu Yasas Nagasinghe, Luciana Benotti, Luis Fernando D'Haro, Marcelo Viridiano, Marcos Estecha-Garitagoitia, Maria Camila Buitrago Cabrera, Mario Rodríguez-Cantelar, Mélanie Jouitteau, Mihail Mihaylov, Mohamed Fazli Mohamed Imam, Muhammad Farid Adilazuarda, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Naome Etori, Olivier Niyomugisha, Paula Mónica Silva, Pranjal Chitale, Raj Dabre, Rendi Chevi, Ruochen Zhang, Ryandito Diandaru, Samuel Cahyawijaya, Santiago Góngora, Soyeong Jeong, Sukannya Purkayastha, Tatsuki Kuribayashi, Thanmay Jayakumar, Tiago Timponi Torrent, Toqeer Ehsan, Vladimir Araujo, Yova Kementchedjhieva, Zara Burzo, Zheng Wei Lim, Zheng Xin Yong, Oana Ignat, Joan Nwatu, Rada Mihalcea, Thamar Solorio, Alham Fikri Aji
cs.AI
摘要
视觉问答(VQA)是多模态人工智能中的重要任务,通常用于测试视觉-语言模型理解和推理视觉和文本数据中的知识能力。然而,大多数当前的VQA模型使用的数据集主要集中在英语和少数主要世界语言上,图像通常以西方为中心。尽管最近的努力试图增加VQA数据集中涵盖的语言数量,但在低资源语言方面仍然缺乏多样性。更重要的是,尽管这些数据集通常通过翻译或其他方法扩展其语言范围,但它们通常保持图像不变,导致文化代表性狭窄。为了解决这些限制,我们构建了CVQA,这是一个新的跨文化多语言视觉问答基准,旨在涵盖丰富的语言和文化,我们在数据收集过程中邀请了母语者和文化专家参与。因此,CVQA包括来自四大洲28个国家的具有文化特色的图像和问题,涵盖26种语言和11种文字,提供了总共9k个问题。然后我们在CVQA上对几种多模态大语言模型(MLLMs)进行基准测试,并表明该数据集对当前最先进的模型具有挑战性。这一基准测试可以作为一个评估套件,用于评估多模态模型的文化能力和偏见,并希望能够鼓励更多的研究努力,以增加该领域的文化意识和语言多样性。
English
Visual Question Answering (VQA) is an important task in multimodal AI, and it
is often used to test the ability of vision-language models to understand and
reason on knowledge present in both visual and textual data. However, most of
the current VQA models use datasets that are primarily focused on English and a
few major world languages, with images that are typically Western-centric.
While recent efforts have tried to increase the number of languages covered on
VQA datasets, they still lack diversity in low-resource languages. More
importantly, although these datasets often extend their linguistic range via
translation or some other approaches, they usually keep images the same,
resulting in narrow cultural representation. To address these limitations, we
construct CVQA, a new Culturally-diverse multilingual Visual Question Answering
benchmark, designed to cover a rich set of languages and cultures, where we
engage native speakers and cultural experts in the data collection process. As
a result, CVQA includes culturally-driven images and questions from across 28
countries on four continents, covering 26 languages with 11 scripts, providing
a total of 9k questions. We then benchmark several Multimodal Large Language
Models (MLLMs) on CVQA, and show that the dataset is challenging for the
current state-of-the-art models. This benchmark can serve as a probing
evaluation suite for assessing the cultural capability and bias of multimodal
models and hopefully encourage more research efforts toward increasing cultural
awareness and linguistic diversity in this field.