AIN：阿拉伯語INclusive大型多模型模型

摘要

在大型語言模型（LLMs）迅速發展並演變為大型多模型模型（LMMs）的過程中，在高資源語言如英語和中文方面取得了顯著進展。儘管阿拉伯語LLMs取得了顯著進展，但阿拉伯語LMMs仍然很少被探索，通常僅專注於語言和視覺理解的一些特定方面。為彌合這一差距，我們介紹了AIN-阿拉伯語包容性多模型模型-旨在在各種領域表現卓越。AIN是一個英阿雙語LMM，旨在在英語和阿拉伯語方面表現卓越，利用精心構建的360萬高質量阿拉伯語-英語多模型數據樣本。AIN展示了最先進的阿拉伯語性能，同時具有強大的英語語言視覺能力。在最近的CAMEL-Bench基準測試中，包括多圖像理解、複雜視覺感知、手寫文件理解、視頻理解、醫學影像、植物疾病和基於遙感的土地利用理解等38個子領域，我們的AIN展示了強大的性能，7B模型在八個領域和38個子領域上的絕對增益超過了GPT-4o的3.4％。AIN優越的能力使其成為向阿拉伯語使用者提供先進多模型生成人工智能工具的重要一步，應用範圍涵蓋各種領域。

English

Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.