달라: 아랍어를 위한 방언 인식 멀티모달 대형 언어 모델

초록

최근의 발전은 멀티모달 대형 언어 모델(MLLM)의 이미지-텍스트 콘텐츠 생성 및 이해 능력을 크게 향상시켰습니다. 이러한 성과에도 불구하고, 다른 언어로 된 고품질 멀티모달 리소스의 부족으로 인해 진전은 주로 영어에 국한되고 있습니다. 이러한 한계는 아랍어와 같은 언어에서 경쟁력 있는 모델 개발을 저해합니다. 이러한 상황을 완화하기 위해, 우리는 LLaMA-2 기반의 고급 언어 모델을 활용하여 멀티모달 상호작용을 용이하게 하는 효율적인 아랍어 멀티모달 어시스턴트인 Dallah을 소개합니다. Dallah은 아랍어 MLLM에서 최첨단 성능을 보여줍니다. 여섯 가지 아랍어 방언을 미세 조정함으로써, Dallah은 텍스트와 시각적 요소를 모두 포함한 복잡한 방언 상호작용을 처리할 수 있는 능력을 입증했습니다. 이 모델은 두 가지 벤치마크 테스트에서 뛰어난 성능을 보였습니다: 하나는 현대 표준 아랍어(MSA)에서의 성능을 평가하는 것이고, 다른 하나는 방언 응답을 평가하기 위해 특별히 설계된 것입니다. 멀티모달 상호작용 작업에서의 견고한 성능을 넘어, Dallah은 방언 인식 아랍어 MLLM의 추가 개발을 위한 길을 열어줄 잠재력을 가지고 있습니다.

English

Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.

달라: 아랍어를 위한 방언 인식 멀티모달 대형 언어 모델

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

초록

Support