Sadeed：通过小型语言模型推进阿拉伯语变音符号标注

摘要

阿拉伯文本的标音处理因其语言的形态丰富性，始终是自然语言处理领域的一大难题。本文提出了一种名为Sadeed的创新方法，该方法基于从Kuwain 1.5B（Hennara等人，2025年）调整而来的仅解码器语言模型，该紧凑模型最初在多样化的阿拉伯语语料库上训练。Sadeed通过严格的数据清洗和标准化流程构建的高质量标音数据集进行微调。尽管计算资源有限，Sadeed在性能上仍能与专有的大型语言模型相媲美，并超越在相似领域训练的传统模型。此外，我们指出了当前阿拉伯语标音基准测试中的关键局限。为解决这些问题，我们推出了SadeedDiac-25，这是一个旨在实现跨多种文本类型和复杂度水平更公平、更全面评估的新基准。Sadeed与SadeedDiac-25共同为推进阿拉伯语自然语言处理应用，包括机器翻译、文本转语音及语言学习工具，奠定了坚实的基础。

English

Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.

Sadeed：通过小型语言模型推进阿拉伯语变音符号标注

Sadeed: Advancing Arabic Diacritization Through Small Language Model

摘要

Support