CVQA:具有文化多樣性的多語言視覺問答基準。
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
June 10, 2024
作者: David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song, Henok Biadglign Ademtew, Hernán Maina, Holy Lovenia, Israel Abebe Azime, Jan Christian Blaise Cruz, Jay Gala, Jiahui Geng, Jesus-German Ortiz-Barajas, Jinheon Baek, Jocelyn Dunstan, Laura Alonso Alemany, Kumaranage Ravindu Yasas Nagasinghe, Luciana Benotti, Luis Fernando D'Haro, Marcelo Viridiano, Marcos Estecha-Garitagoitia, Maria Camila Buitrago Cabrera, Mario Rodríguez-Cantelar, Mélanie Jouitteau, Mihail Mihaylov, Mohamed Fazli Mohamed Imam, Muhammad Farid Adilazuarda, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Naome Etori, Olivier Niyomugisha, Paula Mónica Silva, Pranjal Chitale, Raj Dabre, Rendi Chevi, Ruochen Zhang, Ryandito Diandaru, Samuel Cahyawijaya, Santiago Góngora, Soyeong Jeong, Sukannya Purkayastha, Tatsuki Kuribayashi, Thanmay Jayakumar, Tiago Timponi Torrent, Toqeer Ehsan, Vladimir Araujo, Yova Kementchedjhieva, Zara Burzo, Zheng Wei Lim, Zheng Xin Yong, Oana Ignat, Joan Nwatu, Rada Mihalcea, Thamar Solorio, Alham Fikri Aji
cs.AI
摘要
視覺問答(VQA)是多模式人工智慧中的重要任務,通常用於測試視覺語言模型理解和推理視覺和文本數據中的知識能力。然而,目前大多數VQA模型使用的數據集主要集中在英語和少數主要世界語言上,圖像通常以西方為中心。儘管最近的努力試圖擴大VQA數據集所涵蓋的語言數量,但仍然缺乏低資源語言的多樣性。更重要的是,這些數據集通常通過翻譯或其他方法擴展其語言範圍,但圖像通常保持不變,導致文化代表性狹窄。為解決這些限制,我們構建了CVQA,一個新的跨文化多語言視覺問答基準,旨在涵蓋豐富的語言和文化,我們在數據收集過程中邀請了母語使用者和文化專家參與。因此,CVQA包括來自四大洲28個國家的具有文化特色的圖像和問題,涵蓋26種語言,包括11種文字,提供總共9k個問題。然後,我們在CVQA上對幾個多模式大型語言模型(MLLMs)進行基準測試,並顯示該數據集對於當前最先進的模型來說是具有挑戰性的。這個基準測試可以作為評估多模式模型文化能力和偏見的探測評估套件,並希望能夠鼓勵更多的研究努力,以增加這一領域的文化意識和語言多樣性。
English
Visual Question Answering (VQA) is an important task in multimodal AI, and it
is often used to test the ability of vision-language models to understand and
reason on knowledge present in both visual and textual data. However, most of
the current VQA models use datasets that are primarily focused on English and a
few major world languages, with images that are typically Western-centric.
While recent efforts have tried to increase the number of languages covered on
VQA datasets, they still lack diversity in low-resource languages. More
importantly, although these datasets often extend their linguistic range via
translation or some other approaches, they usually keep images the same,
resulting in narrow cultural representation. To address these limitations, we
construct CVQA, a new Culturally-diverse multilingual Visual Question Answering
benchmark, designed to cover a rich set of languages and cultures, where we
engage native speakers and cultural experts in the data collection process. As
a result, CVQA includes culturally-driven images and questions from across 28
countries on four continents, covering 26 languages with 11 scripts, providing
a total of 9k questions. We then benchmark several Multimodal Large Language
Models (MLLMs) on CVQA, and show that the dataset is challenging for the
current state-of-the-art models. This benchmark can serve as a probing
evaluation suite for assessing the cultural capability and bias of multimodal
models and hopefully encourage more research efforts toward increasing cultural
awareness and linguistic diversity in this field.