PALO：一个适用于50亿人口的多语言大型多模态模型

摘要

为了打造更具包容性的视觉-语言模型（VLMs），本研究引入了一个名为Palo的大型多语言多模型。Palo提供了包括英语、中文、印地语、西班牙语、法语、阿拉伯语、孟加拉语、俄语、乌尔都语和日语在内的10种主要语言的视觉推理能力，覆盖了总共约50亿人口（全球人口的65%）。我们的方法涉及一种半自动化的翻译方法，通过使用经过微调的大型语言模型，将多模态指导数据集从英语翻译到目标语言，从而确保高语言保真度，同时又能够实现可扩展性，减少了手动工作量。引入多样化的指导集有助于提升跨多种语言的整体性能，特别是那些代表性不足的语言，如印地语、阿拉伯语、孟加拉语和乌尔都语。生成的模型在三个规模（1.7B、7B和13B参数）上进行训练，展示了泛化性和可扩展性，我们观察到与强基线相比有显著改进。我们还提出了首个多语言多模态基准，用于评估未来方法的视觉-语言推理能力跨语言的表现。源代码：https://github.com/mbzuai-oryx/PALO。

English

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called Palo. Palo offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of sim5B people (65\% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

PALO：一个适用于50亿人口的多语言大型多模态模型

PALO: A Polyglot Large Multimodal Model for 5B People

摘要

Support