眾包、爬取還是生成?創建SEA-VL:一個面向東南亞的多文化視覺-語言數據集
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
March 10, 2025
作者: Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang, Patricia Nicole Monderin, Yueqi Song, Christian Simon, Lynnette Hui Xian Ng, Richardy Lobo' Sapan, Taki Hasan Rafi, Bin Wang, Supryadi, Kanyakorn Veerakanjana, Piyalitt Ittichaiwong, Matthew Theodore Roque, Karissa Vincentio, Takdanai Kreangphet, Phakphum Artkaew, Kadek Hendrawan Palgunadi, Yanzhi Yu, Rochana Prih Hastuti, William Nixon, Mithil Bangera, Adrian Xuan Wei Lim, Aye Hninn Khine, Hanif Muhammad Zhafran, Teddy Ferdinan, Audra Aurora Izzani, Ayushman Singh, Evan, Jauza Akbar Krito, Michael Anugraha, Fenal Ashokbhai Ilasariya, Haochen Li, John Amadeo Daniswara, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Can Udomcharoenchaikit, Fadil Risdian Ansori, Mahardika Krisna Ihsani, Giang Nguyen, Anab Maulana Barik, Dan John Velasco, Rifo Ahmad Genadi, Saptarshi Saha, Chengwei Wei, Isaiah Flores, Kenneth Ko Han Chen, Anjela Gail Santos, Wan Shen Lim, Kaung Si Phyo, Tim Santos, Meisyarah Dwiastuti, Jiayun Luo, Jan Christian Blaise Cruz, Ming Shan Hee, Ikhlasul Akmal Hanif, M. Alif Al Hakim, Muhammad Rizky Sya'ban, Kun Kerdthaisong, Lester James V. Miranda, Fajri Koto, Tirana Noor Fatyanosa, Alham Fikri Aji, Jostin Jerico Rosal, Jun Kevin, Robert Wijaya, Onno P. Kampman, Ruochen Zhang, Börje F. Karlsson, Peerat Limkonchotiwat
cs.AI
摘要
東南亞(SEA)是一個語言與文化極具多樣性的地區,然而在視覺-語言(VL)研究領域中卻顯著地代表性不足。這種情況往往導致人工智慧(AI)模型無法捕捉到東南亞文化的細微差異。為填補這一空白,我們推出了SEA-VL,這是一項致力於開發高質量、文化相關的東南亞語言資料的開源計畫。透過邀請來自東南亞國家的貢獻者參與,SEA-VL旨在確保更好的文化相關性與多樣性,促進在VL研究中對代表性不足語言的更大包容性。除了眾包之外,我們的計畫更進一步探索了透過爬取與圖像生成自動收集文化相關圖像的方法。首先,我們發現圖像爬取在達到約85%文化相關性的同時,比眾包更具成本與時間效益。其次,儘管生成視覺模型取得了顯著進展,合成圖像在準確反映東南亞文化方面仍不可靠,生成的圖像往往未能體現該地區細膩的傳統與文化背景。總計,我們收集了128萬張東南亞文化相關圖像,規模超過現有其他資料集的50倍。透過SEA-VL,我們期望縮小東南亞在代表性上的差距,推動開發更具包容性的AI系統,真實地呈現東南亞多元文化的面貌。
English
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural
diversity, yet it remains significantly underrepresented in vision-language
(VL) research. This often results in artificial intelligence (AI) models that
fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an
open-source initiative dedicated to developing high-quality, culturally
relevant data for SEA languages. By involving contributors from SEA countries,
SEA-VL aims to ensure better cultural relevance and diversity, fostering
greater inclusivity of underrepresented languages in VL research. Beyond
crowdsourcing, our initiative goes one step further in the exploration of the
automatic collection of culturally relevant images through crawling and image
generation. First, we find that image crawling achieves approximately ~85%
cultural relevance while being more cost- and time-efficient than
crowdsourcing. Second, despite the substantial progress in generative vision
models, synthetic images remain unreliable in accurately reflecting SEA
cultures. The generated images often fail to reflect the nuanced traditions and
cultural contexts of the region. Collectively, we gather 1.28M SEA
culturally-relevant images, more than 50 times larger than other existing
datasets. Through SEA-VL, we aim to bridge the representation gap in SEA,
fostering the development of more inclusive AI systems that authentically
represent diverse cultures across SEA.Summary
AI-Generated Summary