Sadeed：通過小型語言模型推進阿拉伯語變音符號標註

摘要

阿拉伯文本的標音處理在自然語言處理領域中仍是一個持續的挑戰，這主要歸因於該語言豐富的形態學特性。本文介紹了Sadeed，這是一種基於Kuwain 1.5B（Hennara等人，2025年）微調的解碼器專用語言模型的新方法，該模型最初是在多樣化的阿拉伯語料庫上訓練的緊湊型模型。Sadeed在經過嚴格數據清洗和標準化流程構建的高質量標音數據集上進行了微調。儘管使用了適度的計算資源，Sadeed在與專有大型語言模型的比較中取得了競爭性的成果，並且在相似領域訓練的傳統模型上表現更優。此外，我們指出了當前阿拉伯標音處理基準測試實踐中的關鍵限制。為解決這些問題，我們引入了SadeedDiac-25，這是一個旨在實現跨多樣文本類型和複雜度水平更公平、更全面評估的新基準。Sadeed與SadeedDiac-25共同為推進阿拉伯語自然語言處理應用（包括機器翻譯、文本轉語音及語言學習工具）提供了堅實的基礎。

English

Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.