Nile-Chat:面向阿拉伯与拉丁字母的埃及语言模型
Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts
July 6, 2025
作者: Guokan Shang, Hadi Abdine, Ahmad Chamma, Amr Mohamed, Mohamed Anwar, Abdelaziz Bounhar, Omar El Herraoui, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
cs.AI
摘要
我们推出了Nile-Chat-4B、3x4B-A6B和12B,这是一系列专为埃及方言设计的大型语言模型(LLMs),独特之处在于其能够理解并生成同时使用阿拉伯字母和拉丁字母书写的文本。特别是,通过Nile-Chat-3x4B-A6B,我们引入了一种新颖的语言适应方法,采用分支-训练-混合(Branch-Train-MiX)策略,将专精于不同书写体系的专家模型融合为一个混合专家(MoE)模型。我们的Nile-Chat模型在新推出的埃及评估基准上,显著超越了LLaMa、Jais和ALLaM等领先的多语言及阿拉伯语LLMs,这些基准涵盖了理解和生成任务。值得注意的是,我们的12B模型在拉丁字母基准测试中,相比Qwen2.5-14B-Instruct实现了14.4%的性能提升。所有资源均已公开。我们相信,这项工作为适应双书写体系语言的LLMs提供了一套全面的方法论,填补了现代LLM开发中常被忽视的一个方面。
English
We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for
Egyptian dialect, uniquely designed to understand and generate texts written in
both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we
introduce a novel language adaptation approach by leveraging the
Branch-Train-MiX strategy to merge script-specialized experts, into a single
MoE model. Our Nile-Chat models significantly outperform leading multilingual
and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced
Egyptian evaluation benchmarks, which span both understanding and generative
tasks. Notably, our 12B model yields a 14.4% performance gain over
Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly
available. We believe this work presents a comprehensive methodology for
adapting LLMs to dual-script languages, addressing an often overlooked aspect
in modern LLM development.