Nile-Chat: 아랍어 및 라틴 문자를 위한 이집트 언어 모델

초록

이집트 방언을 위해 특별히 설계된 Nile-Chat-4B, 3x4B-A6B, 12B 모델을 소개합니다. 이 모델들은 아랍어와 라틴 문자로 작성된 텍스트를 이해하고 생성할 수 있도록 고안되었습니다. 특히 Nile-Chat-3x4B-A6B 모델에서는 Branch-Train-MiX 전략을 활용하여 스크립트 전문가들을 단일 MoE(Mixture of Experts) 모델로 통합하는 새로운 언어 적응 방식을 도입했습니다. 우리의 Nile-Chat 모델들은 새롭게 도입된 이집트 평가 벤치마크에서 LLaMa, Jais, ALLaM과 같은 주요 다국어 및 아랍어 LLM들을 크게 능가하는 성능을 보여줍니다. 특히, 12B 모델은 라틴 문자 벤치마크에서 Qwen2.5-14B-Instruct 대비 14.4%의 성능 향상을 달성했습니다. 모든 리소스는 공개적으로 제공됩니다. 이 연구는 현대 LLM 개발에서 종종 간과되는 이중 스크립트 언어에 대한 적응 방법론을 포괄적으로 제시한다고 믿습니다.

English

We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

Nile-Chat: 아랍어 및 라틴 문자를 위한 이집트 언어 모델

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

초록

Support