ナイルチャット：アラビア文字とラテン文字のためのエジプト言語モデル

要旨

我々は、エジプト方言に特化したLLM（大規模言語モデル）のコレクションであるNile-Chat-4B、3x4B-A6B、および12Bを紹介する。これらのモデルは、アラビア文字とラテン文字の両方で書かれたテキストを理解し、生成するために独自に設計されている。特に、Nile-Chat-3x4B-A6Bでは、Branch-Train-MiX戦略を活用して、文字特化の専門家を単一のMoE（Mixture of Experts）モデルに統合するという新たな言語適応手法を導入した。我々のNile-Chatモデルは、新たに導入したエジプト方言の評価ベンチマークにおいて、LLaMa、Jais、ALLaMなどの主要な多言語およびアラビア語LLMを大幅に上回る性能を示した。特に、12Bモデルは、ラテン文字ベンチマークにおいてQwen2.5-14B-Instructに対して14.4%の性能向上を達成した。我々のすべてのリソースは公開されており、この研究が現代のLLM開発においてしばしば見過ごされがちな二重文字言語への適応方法論を包括的に提示していると確信している。

English

We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

ナイルチャット：アラビア文字とラテン文字のためのエジプト言語モデル

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

要旨

Support