Dallah: アラビア語のための方言認識型マルチモーダル大規模言語モデル

要旨

近年の進歩により、マルチモーダル大規模言語モデル（MLLM）の画像からテキストへの生成および理解能力が大幅に向上しました。しかし、これらの成功にもかかわらず、進展は主に英語に限定されており、他の言語での高品質なマルチモーダルリソースの不足が原因です。この制約は、アラビア語などの言語での競争力のあるモデルの開発を妨げています。この状況を改善するため、我々はLLaMA-2に基づく先進的な言語モデルを活用した効率的なアラビア語マルチモーダルアシスタント「Dallah」を紹介します。Dallahは、アラビア語MLLMにおいて最先端の性能を発揮します。6つのアラビア語方言をファインチューニングすることで、Dallahはテキストと視覚要素を組み込んだ複雑な方言間の相互作用を処理する能力を示しています。このモデルは、現代標準アラビア語（MSA）の性能を評価するベンチマークテストと、方言応答を評価するために特別に設計されたテストの両方で優れた成績を収めています。マルチモーダル相互作用タスクにおける堅牢な性能に加えて、Dallahは方言を意識したアラビア語MLLMのさらなる開発の道を開く可能性を秘めています。

English

Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.

Dallah: アラビア語のための方言認識型マルチモーダル大規模言語モデル

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

要旨

Support