Nile-Chat：面向阿拉伯与拉丁字母的埃及语言模型

摘要

我们推出了Nile-Chat-4B、3x4B-A6B和12B，这是一系列专为埃及方言设计的大型语言模型（LLMs），独特之处在于其能够理解并生成同时使用阿拉伯字母和拉丁字母书写的文本。特别是，通过Nile-Chat-3x4B-A6B，我们引入了一种新颖的语言适应方法，采用分支-训练-混合（Branch-Train-MiX）策略，将专精于不同书写体系的专家模型融合为一个混合专家（MoE）模型。我们的Nile-Chat模型在新推出的埃及评估基准上，显著超越了LLaMa、Jais和ALLaM等领先的多语言及阿拉伯语LLMs，这些基准涵盖了理解和生成任务。值得注意的是，我们的12B模型在拉丁字母基准测试中，相比Qwen2.5-14B-Instruct实现了14.4%的性能提升。所有资源均已公开。我们相信，这项工作为适应双书写体系语言的LLMs提供了一套全面的方法论，填补了现代LLM开发中常被忽视的一个方面。

English

We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

Nile-Chat：面向阿拉伯与拉丁字母的埃及语言模型

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

摘要

Support