Atlas-Chat:為低資源摩洛哥阿拉伯方言調整大型語言模型
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
September 26, 2024
作者: Guokan Shang, Hadi Abdine, Yousef Khoubrane, Amr Mohamed, Yassine Abbahaddou, Sofiane Ennadir, Imane Momayiz, Xuguang Ren, Eric Moulines, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
cs.AI
摘要
我們介紹了Atlas-Chat,這是專門為方言阿拉伯語開發的首個大型語言模型集合。我們專注於摩洛哥阿拉伯語,也被稱為Darija,通過整合現有的Darija語言資源、手動和合成創建新的數據集,以及進行嚴格的質量控制來構建我們的指令數據集。在該數據集上進行微調的Atlas-Chat-9B和2B模型展現出在遵循Darija指令和執行標準自然語言處理任務方面的優越能力。值得注意的是,我們的模型在DarijaMMLU上表現優於當前最先進的阿拉伯語專用LLM(如LLaMa、Jais和AceGPT),例如在我們新引入的Darija評估套件中,涵蓋了區分性和生成性任務,相對於更大的13B模型實現了13%的性能提升。此外,我們對各種微調策略和基本模型選擇進行了實驗分析,以確定最佳配置。我們所有的資源都是公開可訪問的,我們認為我們的工作提供了針對低資源語言變體的指令微調全面設計方法,這些語言在當代LLM中常常被忽視,因為現有豐富的數據語言。
English
We introduce Atlas-Chat, the first-ever collection of large language models
specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also
known as Darija, we construct our instruction dataset by consolidating existing
Darija language resources, creating novel datasets both manually and
synthetically, and translating English instructions with stringent quality
control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit
superior ability in following Darija instructions and performing standard NLP
tasks. Notably, our models outperform both state-of-the-art and
Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13%
performance boost over a larger 13B model on DarijaMMLU, in our newly
introduced evaluation suite for Darija covering both discriminative and
generative tasks. Furthermore, we perform an experimental analysis of various
fine-tuning strategies and base model choices to determine optimal
configurations. All our resources are publicly accessible, and we believe our
work offers comprehensive design methodologies of instruction-tuning for
low-resource language variants, which are often neglected in favor of data-rich
languages by contemporary LLMs.Summary
AI-Generated Summary