Dallah：一种针对阿拉伯语的方言感知多模态大型语言模型

摘要

最近的进展显著增强了多模态大型语言模型（MLLMs）在生成和理解图像到文本内容方面的能力。尽管取得了这些成功，但由于其他语言中高质量多模态资源的稀缺，进展主要局限于英语。这一限制阻碍了在阿拉伯语等语言中开发竞争性模型。为了缓解这种情况，我们引入了一款名为 Dallah 的高效阿拉伯语多模态助手，该助手利用基于LLaMA-2的先进语言模型促进多模态交互。Dallah 在阿拉伯语MLLMs中展示了最先进的性能。通过对六种阿拉伯方言进行微调，Dallah 展示了其处理包含文本和视觉元素的复杂方言交互的能力。该模型在两项基准测试中表现出色：一项评估其在现代标准阿拉伯（MSA）上的表现，另一项专门设计用于评估方言响应。除了在多模态交互任务中表现出色外，Dallah 还有潜力为方言感知的阿拉伯语MLLMs的进一步发展铺平道路。

English

Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.

Dallah：一种针对阿拉伯语的方言感知多模态大型语言模型

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

摘要

Support