Sadeed: 소형 언어 모델을 통한 아랍어 발음 구별 기호 처리 기술의 발전

초록

아랍어 텍스트 발음 구별 기호 부착은 해당 언어의 형태론적 풍부함으로 인해 자연어 처리 분야에서 지속적인 과제로 남아 있습니다. 본 논문에서는 다양한 아랍어 코퍼스로 사전 학습된 컴팩트 모델인 Kuwain 1.5B(Hennara 등, 2025)를 기반으로 한 디코더 전용 언어 모델을 미세 조정한 새로운 접근 방식인 Sadeed를 소개합니다. Sadeed는 엄격한 데이터 정제 및 정규화 파이프라인을 통해 구축된 고품질 발음 구별 기호 데이터셋에 대해 미세 조정되었습니다. 비교적 적은 계산 자원을 사용함에도 불구하고, Sadeed는 상용 대형 언어 모델과 경쟁력 있는 결과를 달성하며 유사한 도메인에서 학습된 전통적인 모델들을 능가합니다. 또한, 우리는 현재 아랍어 발음 구별 기호 부착 벤치마킹 관행의 주요 한계점을 강조합니다. 이러한 문제를 해결하기 위해 다양한 텍스트 장르와 복잡도 수준에 걸쳐 공정하고 포괄적인 평가를 가능하게 하는 새로운 벤치마크인 SadeedDiac-25를 소개합니다. Sadeed와 SadeedDiac-25는 기계 번역, 텍스트 음성 변환, 언어 학습 도구를 포함한 아랍어 NLP 응용 프로그램 발전을 위한 견고한 기반을 제공합니다.

English

Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.