尼罗河对话:面向阿拉伯与拉丁文字体系的埃及语言模型
Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts
July 6, 2025
作者: Guokan Shang, Hadi Abdine, Ahmad Chamma, Amr Mohamed, Mohamed Anwar, Abdelaziz Bounhar, Omar El Herraoui, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
cs.AI
摘要
我們推出了Nile-Chat-4B、3x4B-A6B及12B,這是一系列專為埃及方言設計的大型語言模型(LLMs),其獨特之處在於能夠理解並生成以阿拉伯字母及拉丁字母書寫的文本。特別地,通過Nile-Chat-3x4B-A6B,我們引入了一種新穎的語言適應方法,利用Branch-Train-MiX策略將專精於不同書寫體系的專家模型融合為一個單一的混合專家(MoE)模型。在我們新推出的埃及語評估基準上,涵蓋理解與生成任務,Nile-Chat系列模型顯著超越了領先的多語言及阿拉伯語LLMs,如LLaMa、Jais和ALLaM。值得注意的是,我們的12B模型在拉丁字母基準測試中,相較於Qwen2.5-14B-Instruct,性能提升了14.4%。所有資源均已公開。我們相信,這項工作為適應雙書寫體系語言的LLMs提供了一套全面的方法論,填補了現代LLM開發中常被忽視的一環。
English
We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for
Egyptian dialect, uniquely designed to understand and generate texts written in
both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we
introduce a novel language adaptation approach by leveraging the
Branch-Train-MiX strategy to merge script-specialized experts, into a single
MoE model. Our Nile-Chat models significantly outperform leading multilingual
and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced
Egyptian evaluation benchmarks, which span both understanding and generative
tasks. Notably, our 12B model yields a 14.4% performance gain over
Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly
available. We believe this work presents a comprehensive methodology for
adapting LLMs to dual-script languages, addressing an often overlooked aspect
in modern LLM development.