Pangea:一个涵盖39种语言的完全开放的多语言多模态LLM
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
October 21, 2024
作者: Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig
cs.AI
摘要
尽管多模态大型语言模型(MLLMs)近年来取得了进展,但它们的发展主要集中在英语和西方为中心的数据集和任务上,导致世界上大多数语言和多样化文化背景的代表性不足。本文介绍了Pangea,这是一个多语言多模态LLM,它是在PangeaIns上训练的,该数据集包含39种语言的多样化600万条指令。PangeaIns具有以下特点:1)高质量的英语指令,2)经过精心机器翻译的指令,以及3)涵盖跨文化内容的文化相关多模态任务。为了严格评估模型的能力,我们引入了PangeaBench,这是一个全面的评估套件,涵盖了47种语言的14个数据集。结果显示,Pangea在多语言环境和多样化文化背景下明显优于现有的开源模型。消融研究进一步揭示了英语数据比例、语言流行度以及多模态训练样本数量对整体性能的重要性。我们完全开源我们的数据、代码和训练检查点,以促进包容性和健壮的多语言MLLMs的发展,推动在更广泛的语言和文化领域实现公平和可访问性。
English
Despite recent advances in multimodal large language models (MLLMs), their
development has predominantly focused on English- and western-centric datasets
and tasks, leaving most of the world's languages and diverse cultural contexts
underrepresented. This paper introduces Pangea, a multilingual multimodal LLM
trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages.
PangeaIns features: 1) high-quality English instructions, 2) carefully
machine-translated instructions, and 3) culturally relevant multimodal tasks to
ensure cross-cultural coverage. To rigorously assess models' capabilities, we
introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets
covering 47 languages. Results show that Pangea significantly outperforms
existing open-source models in multilingual settings and diverse cultural
contexts. Ablation studies further reveal the importance of English data
proportions, language popularity, and the number of multimodal training samples
on overall performance. We fully open-source our data, code, and trained
checkpoints, to facilitate the development of inclusive and robust multilingual
MLLMs, promoting equity and accessibility across a broader linguistic and
cultural spectrum.Summary
AI-Generated Summary