PALO:一个适用于50亿人口的多语言大型多模态模型
PALO: A Polyglot Large Multimodal Model for 5B People
February 22, 2024
作者: Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan
cs.AI
摘要
为了打造更具包容性的视觉-语言模型(VLMs),本研究引入了一个名为Palo的大型多语言多模型。Palo提供了包括英语、中文、印地语、西班牙语、法语、阿拉伯语、孟加拉语、俄语、乌尔都语和日语在内的10种主要语言的视觉推理能力,覆盖了总共约50亿人口(全球人口的65%)。我们的方法涉及一种半自动化的翻译方法,通过使用经过微调的大型语言模型,将多模态指导数据集从英语翻译到目标语言,从而确保高语言保真度,同时又能够实现可扩展性,减少了手动工作量。引入多样化的指导集有助于提升跨多种语言的整体性能,特别是那些代表性不足的语言,如印地语、阿拉伯语、孟加拉语和乌尔都语。生成的模型在三个规模(1.7B、7B和13B参数)上进行训练,展示了泛化性和可扩展性,我们观察到与强基线相比有显著改进。我们还提出了首个多语言多模态基准,用于评估未来方法的视觉-语言推理能力跨语言的表现。源代码:https://github.com/mbzuai-oryx/PALO。
English
In pursuit of more inclusive Vision-Language Models (VLMs), this study
introduces a Large Multilingual Multimodal Model called Palo.
Palo offers visual reasoning capabilities in 10 major languages,
including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian,
Urdu, and Japanese, that span a total of sim5B people (65\% of the world
population). Our approach involves a semi-automated translation approach to
adapt the multimodal instruction dataset from English to the target languages
using a fine-tuned Large Language Model, thereby ensuring high linguistic
fidelity while allowing scalability due to minimal manual effort. The
incorporation of diverse instruction sets helps us boost overall performance
across multiple languages especially those that are underrepresented like
Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three
scales (1.7B, 7B and 13B parameters) to show the generalization and scalability
where we observe substantial improvements compared to strong baselines. We also
propose the first multilingual multimodal benchmark for the forthcoming
approaches to evaluate their vision-language reasoning capabilities across
languages. Code: https://github.com/mbzuai-oryx/PALO.