PALO：一個針對50億人口的多語言大型多模型

摘要

為了打造更具包容性的視覺語言模型（VLMs），本研究引入了一個名為Palo的大型多語種多模型。Palo提供了10種主要語言的視覺推理能力，包括英語、中文、印地語、西班牙語、法語、阿拉伯語、孟加拉語、俄語、烏爾都語和日語，總共涵蓋約sim5B人口（佔全球人口的65%）。我們的方法包括使用半自動化翻譯方法，將多模式指導數據集從英語調整到目標語言，利用一個經過微調的大型語言模型，確保高語言保真度的同時，由於最小的手動工作量，實現可擴展性。融合多樣化的指導集有助於提高跨多種語言的整體性能，尤其是對於印地語、阿拉伯語、孟加拉語和烏爾都語等少見語言。所得模型在三個規模（1.7B、7B和13B參數）上進行訓練，展示了泛化性和可擴展性，我們觀察到與強基準線相比有顯著改進。我們還提出了首個多語種多模式基準測試，用於評估未來方法的視覺語言推理能力跨語言。程式碼：https://github.com/mbzuai-oryx/PALO。

English

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called Palo. Palo offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of sim5B people (65\% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

PALO：一個針對50億人口的多語言大型多模型

PALO: A Polyglot Large Multimodal Model for 5B People

摘要

Support