推进阿拉伯语反向词典系统:基于Transformer的方法与数据集构建指南
Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines
April 30, 2025
作者: Serry Sibaee, Samar Ahmed, Abdullah Al Harbi, Omer Nacar, Adel Ammar, Yasser Habashi, Wadii Boulila
cs.AI
摘要
本研究针对阿拉伯语自然语言处理领域的关键空白,开发了一种高效的阿拉伯语反向词典(RD)系统,使用户能够根据描述或含义查找词语。我们提出了一种基于Transformer的创新方法,采用半编码器神经网络架构,其几何递减层在阿拉伯语RD任务中实现了最先进的性能。我们的方法包含全面的数据集构建过程,并为阿拉伯语词典定义建立了正式的质量标准。通过多种预训练模型的实验表明,阿拉伯语专用模型显著优于通用多语言嵌入模型,其中ARBERTv2取得了最佳排序得分(0.0644)。此外,我们提供了反向词典任务的正式抽象,增强了理论理解,并开发了一个模块化、可扩展的Python库(RDTL),具有可配置的训练管道。我们对数据集质量的分析揭示了改进阿拉伯语定义构建的重要见解,提出了构建高质量反向词典资源的八项具体标准。这项工作对阿拉伯语计算语言学做出了重要贡献,为阿拉伯语的语言学习、学术写作和专业交流提供了宝贵工具。
English
This study addresses the critical gap in Arabic natural language processing
by developing an effective Arabic Reverse Dictionary (RD) system that enables
users to find words based on their descriptions or meanings. We present a novel
transformer-based approach with a semi-encoder neural network architecture
featuring geometrically decreasing layers that achieves state-of-the-art
results for Arabic RD tasks. Our methodology incorporates a comprehensive
dataset construction process and establishes formal quality standards for
Arabic lexicographic definitions. Experiments with various pre-trained models
demonstrate that Arabic-specific models significantly outperform general
multilingual embeddings, with ARBERTv2 achieving the best ranking score
(0.0644). Additionally, we provide a formal abstraction of the reverse
dictionary task that enhances theoretical understanding and develop a modular,
extensible Python library (RDTL) with configurable training pipelines. Our
analysis of dataset quality reveals important insights for improving Arabic
definition construction, leading to eight specific standards for building
high-quality reverse dictionary resources. This work contributes significantly
to Arabic computational linguistics and provides valuable tools for language
learning, academic writing, and professional communication in Arabic.Summary
AI-Generated Summary