Atlas-Chat：为低资源的摩洛哥阿拉伯方言调整大型语言模型

摘要

我们介绍了Atlas-Chat，这是首个专门为方言阿拉伯语开发的大型语言模型集合。我们专注于摩洛哥阿拉伯语，也称为达里加，通过整合现有的达里加语言资源、手动和合成地创建新数据集，并通过严格的质量控制将英语指令翻译成达里加语来构建我们的指令数据集。在该数据集上微调的Atlas-Chat-9B和2B模型展现出优越的能力，能够遵循达里加语指令并执行标准自然语言处理任务。值得注意的是，我们的模型在达里加MMLU上表现优于最先进的阿拉伯语专用LLM（如LLaMa、Jais和AceGPT），例如，在我们新引入的旨在涵盖判别性和生成性任务的达里加评估套件上，与更大的13B模型相比，实现了13%的性能提升。此外，我们对各种微调策略和基础模型选择进行了实验分析，以确定最佳配置。我们所有的资源都是公开可访问的，我们认为我们的工作提供了面向低资源语言变体的指令微调的全面设计方法，这些语言变体在当代LLM中常常被忽视，而更多关注数据丰富的语言。

English

We introduce Atlas-Chat, the first-ever collection of large language models specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource language variants, which are often neglected in favor of data-rich languages by contemporary LLMs.

Atlas-Chat：为低资源的摩洛哥阿拉伯方言调整大型语言模型

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

摘要

Support