利用嵌套嵌入学习增强阿拉伯语自然语言处理中的语义相似性理解
Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning
July 30, 2024
作者: Omer Nacar, Anis Koubaa
cs.AI
摘要
本文提出了一个新颖的框架,通过母嵌套学习(Matryoshka Embedding Learning)来训练阿拉伯语嵌套嵌入模型,利用多语言、阿拉伯语特定和基于英语的模型,突出了嵌套嵌入模型在各种阿拉伯语自然语言处理下游任务中的强大能力。我们的创新贡献包括将各种句子相似性数据集翻译成阿拉伯语,从而实现了一个全面的评估框架,以比较这些模型在不同维度上的表现。我们在阿拉伯语自然语言推理三元组数据集上训练了几个嵌套嵌入模型,并使用多个评估指标对它们的性能进行了评估,包括余弦相似度、曼哈顿距离、欧氏距离和点积相似度的皮尔逊和斯皮尔曼相关性。结果表明,Matryoshka嵌入模型在捕捉阿拉伯语中独特语义细微差别方面表现出卓越性能。结果表明,阿拉伯语Matryoshka嵌入模型在捕捉阿拉伯语中独特语义细微差别方面表现出卓越性能,在各种相似性指标上比传统模型表现出高达20-25\%的优越性。这些结果强调了语言特定训练的有效性,并突显了Matryoshka模型在增强阿拉伯语自然语言处理中语义文本相似性任务的潜力。
English
This work presents a novel framework for training Arabic nested embedding
models through Matryoshka Embedding Learning, leveraging multilingual,
Arabic-specific, and English-based models, to highlight the power of nested
embeddings models in various Arabic NLP downstream tasks. Our innovative
contribution includes the translation of various sentence similarity datasets
into Arabic, enabling a comprehensive evaluation framework to compare these
models across different dimensions. We trained several nested embedding models
on the Arabic Natural Language Inference triplet dataset and assessed their
performance using multiple evaluation metrics, including Pearson and Spearman
correlations for cosine similarity, Manhattan distance, Euclidean distance, and
dot product similarity. The results demonstrate the superior performance of the
Matryoshka embedding models, particularly in capturing semantic nuances unique
to the Arabic language. Results demonstrated that Arabic Matryoshka embedding
models have superior performance in capturing semantic nuances unique to the
Arabic language, significantly outperforming traditional models by up to
20-25\% across various similarity metrics. These results underscore the
effectiveness of language-specific training and highlight the potential of
Matryoshka models in enhancing semantic textual similarity tasks for Arabic
NLP.Summary
AI-Generated Summary