SHAMI-MT:敘利亞阿拉伯方言與現代標準阿拉伯語雙向機器翻譯系統
SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System
August 4, 2025
作者: Serry Sibaee, Omer Nacar, Yasser Al-Habashi, Adel Ammar, Wadii Boulila
cs.AI
摘要
阿拉伯世界豐富的語言景觀中,現代標準阿拉伯語(MSA)作為正式交流的語言,與日常生活中使用的多樣化地區方言之間存在顯著差距。這種雙言現象對自然語言處理,尤其是機器翻譯,構成了巨大挑戰。本文介紹了SHAMI-MT,這是一個專門設計的雙向機器翻譯系統,旨在彌合MSA與敘利亞方言之間的溝通鴻溝。我們提出了兩個專用模型,一個用於MSA到敘利亞方言的翻譯,另一個則用於敘利亞方言到MSA的翻譯,兩者均基於最先進的AraT5v2-base-1024架構構建。這些模型在全面的Nabra數據集上進行了微調,並在MADAR語料庫的未見數據上進行了嚴格評估。我們的MSA到敘利亞方言模型在OPENAI模型GPT-4.1的評判下,獲得了4.01分(滿分5.0)的卓越平均質量分數,展示了其不僅能產生準確翻譯,還能保持方言真實性的能力。這項工作為先前服務不足的語言對提供了一個關鍵的高保真工具,推動了阿拉伯方言翻譯領域的發展,並在內容本地化、文化遺產保護及跨文化交流方面提供了重要應用。
English
The rich linguistic landscape of the Arab world is characterized by a
significant gap between Modern Standard Arabic (MSA), the language of formal
communication, and the diverse regional dialects used in everyday life. This
diglossia presents a formidable challenge for natural language processing,
particularly machine translation. This paper introduces SHAMI-MT, a
bidirectional machine translation system specifically engineered to bridge the
communication gap between MSA and the Syrian dialect. We present two
specialized models, one for MSA-to-Shami and another for Shami-to-MSA
translation, both built upon the state-of-the-art AraT5v2-base-1024
architecture. The models were fine-tuned on the comprehensive Nabra dataset and
rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami
model achieved an outstanding average quality score of 4.01 out of 5.0
when judged by OPENAI model GPT-4.1, demonstrating its ability to produce
translations that are not only accurate but also dialectally authentic. This
work provides a crucial, high-fidelity tool for a previously underserved
language pair, advancing the field of dialectal Arabic translation and offering
significant applications in content localization, cultural heritage, and
intercultural communication.