尼罗河对话：面向阿拉伯与拉丁文字体系的埃及语言模型

摘要

我們推出了Nile-Chat-4B、3x4B-A6B及12B，這是一系列專為埃及方言設計的大型語言模型（LLMs），其獨特之處在於能夠理解並生成以阿拉伯字母及拉丁字母書寫的文本。特別地，通過Nile-Chat-3x4B-A6B，我們引入了一種新穎的語言適應方法，利用Branch-Train-MiX策略將專精於不同書寫體系的專家模型融合為一個單一的混合專家（MoE）模型。在我們新推出的埃及語評估基準上，涵蓋理解與生成任務，Nile-Chat系列模型顯著超越了領先的多語言及阿拉伯語LLMs，如LLaMa、Jais和ALLaM。值得注意的是，我們的12B模型在拉丁字母基準測試中，相較於Qwen2.5-14B-Instruct，性能提升了14.4%。所有資源均已公開。我們相信，這項工作為適應雙書寫體系語言的LLMs提供了一套全面的方法論，填補了現代LLM開發中常被忽視的一環。

English

We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

尼罗河对话：面向阿拉伯与拉丁文字体系的埃及语言模型

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

摘要

Support