ChatPaper.aiChatPaper

利用嵌套嵌入学习增强阿拉伯语自然语言处理中的语义相似性理解

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

July 30, 2024
作者: Omer Nacar, Anis Koubaa
cs.AI

摘要

本文提出了一个新颖的框架,通过母嵌套学习(Matryoshka Embedding Learning)来训练阿拉伯语嵌套嵌入模型,利用多语言、阿拉伯语特定和基于英语的模型,突出了嵌套嵌入模型在各种阿拉伯语自然语言处理下游任务中的强大能力。我们的创新贡献包括将各种句子相似性数据集翻译成阿拉伯语,从而实现了一个全面的评估框架,以比较这些模型在不同维度上的表现。我们在阿拉伯语自然语言推理三元组数据集上训练了几个嵌套嵌入模型,并使用多个评估指标对它们的性能进行了评估,包括余弦相似度、曼哈顿距离、欧氏距离和点积相似度的皮尔逊和斯皮尔曼相关性。结果表明,Matryoshka嵌入模型在捕捉阿拉伯语中独特语义细微差别方面表现出卓越性能。结果表明,阿拉伯语Matryoshka嵌入模型在捕捉阿拉伯语中独特语义细微差别方面表现出卓越性能,在各种相似性指标上比传统模型表现出高达20-25\%的优越性。这些结果强调了语言特定训练的有效性,并突显了Matryoshka模型在增强阿拉伯语自然语言处理中语义文本相似性任务的潜力。
English
This work presents a novel framework for training Arabic nested embedding models through Matryoshka Embedding Learning, leveraging multilingual, Arabic-specific, and English-based models, to highlight the power of nested embeddings models in various Arabic NLP downstream tasks. Our innovative contribution includes the translation of various sentence similarity datasets into Arabic, enabling a comprehensive evaluation framework to compare these models across different dimensions. We trained several nested embedding models on the Arabic Natural Language Inference triplet dataset and assessed their performance using multiple evaluation metrics, including Pearson and Spearman correlations for cosine similarity, Manhattan distance, Euclidean distance, and dot product similarity. The results demonstrate the superior performance of the Matryoshka embedding models, particularly in capturing semantic nuances unique to the Arabic language. Results demonstrated that Arabic Matryoshka embedding models have superior performance in capturing semantic nuances unique to the Arabic language, significantly outperforming traditional models by up to 20-25\% across various similarity metrics. These results underscore the effectiveness of language-specific training and highlight the potential of Matryoshka models in enhancing semantic textual similarity tasks for Arabic NLP.

Summary

AI-Generated Summary

PDF62November 28, 2024