PALO: 50억 인구를 위한 다국어 대규모 멀티모달 모델

초록

보다 포괄적인 Vision-Language Models(VLMs)을 추구하는 이 연구는 Palo라는 대규모 다국어 멀티모달 모델을 소개합니다. Palo는 영어, 중국어, 힌디어, 스페인어, 프랑스어, 아랍어, 벵골어, 러시아어, 우르두어, 일본어 등 총 10개의 주요 언어에서 시각적 추론 능력을 제공하며, 이는 전 세계 인구의 65%에 해당하는 약 50억 명을 아우릅니다. 우리의 접근 방식은 미세 조정된 대형 언어 모델을 사용하여 멀티모달 명령어 데이터셋을 영어에서 대상 언어로 반자동 번역하는 것으로, 높은 언어적 충실도를 보장하면서도 최소한의 수작업으로 확장성을 가능하게 합니다. 다양한 명령어 세트를 통합함으로써 힌디어, 아랍어, 벵골어, 우르두어와 같이 상대적으로 덜 다뤄진 언어를 포함한 다국어 전반의 성능을 향상시킬 수 있었습니다. 결과적으로 얻은 모델은 1.7B, 7B, 13B 파라미터의 세 가지 규모로 훈련되어 일반화 및 확장성을 보여주며, 강력한 베이스라인 대비 상당한 개선을 관찰할 수 있습니다. 또한, 우리는 다국어 멀티모달 벤치마크를 최초로 제안하여 향후 접근법들이 다양한 언어 간의 시각-언어 추론 능력을 평가할 수 있도록 합니다. 코드: https://github.com/mbzuai-oryx/PALO.

English

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called Palo. Palo offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of sim5B people (65\% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

PALO: 50억 인구를 위한 다국어 대규모 멀티모달 모델

PALO: A Polyglot Large Multimodal Model for 5B People

초록

Support