Atlas-Chat:为低资源的摩洛哥阿拉伯方言调整大型语言模型
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
September 26, 2024
作者: Guokan Shang, Hadi Abdine, Yousef Khoubrane, Amr Mohamed, Yassine Abbahaddou, Sofiane Ennadir, Imane Momayiz, Xuguang Ren, Eric Moulines, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
cs.AI
摘要
我们介绍了Atlas-Chat,这是首个专门为方言阿拉伯语开发的大型语言模型集合。我们专注于摩洛哥阿拉伯语,也称为达里加,通过整合现有的达里加语言资源、手动和合成地创建新数据集,并通过严格的质量控制将英语指令翻译成达里加语来构建我们的指令数据集。在该数据集上微调的Atlas-Chat-9B和2B模型展现出优越的能力,能够遵循达里加语指令并执行标准自然语言处理任务。值得注意的是,我们的模型在达里加MMLU上表现优于最先进的阿拉伯语专用LLM(如LLaMa、Jais和AceGPT),例如,在我们新引入的旨在涵盖判别性和生成性任务的达里加评估套件上,与更大的13B模型相比,实现了13%的性能提升。此外,我们对各种微调策略和基础模型选择进行了实验分析,以确定最佳配置。我们所有的资源都是公开可访问的,我们认为我们的工作提供了面向低资源语言变体的指令微调的全面设计方法,这些语言变体在当代LLM中常常被忽视,而更多关注数据丰富的语言。
English
We introduce Atlas-Chat, the first-ever collection of large language models
specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also
known as Darija, we construct our instruction dataset by consolidating existing
Darija language resources, creating novel datasets both manually and
synthetically, and translating English instructions with stringent quality
control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit
superior ability in following Darija instructions and performing standard NLP
tasks. Notably, our models outperform both state-of-the-art and
Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13%
performance boost over a larger 13B model on DarijaMMLU, in our newly
introduced evaluation suite for Darija covering both discriminative and
generative tasks. Furthermore, we perform an experimental analysis of various
fine-tuning strategies and base model choices to determine optimal
configurations. All our resources are publicly accessible, and we believe our
work offers comprehensive design methodologies of instruction-tuning for
low-resource language variants, which are often neglected in favor of data-rich
languages by contemporary LLMs.Summary
AI-Generated Summary