通過嵌入式學習來增強阿拉伯語自然語言處理中的語義相似性理解
Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning
July 30, 2024
作者: Omer Nacar, Anis Koubaa
cs.AI
摘要
本研究提出了一個新穎的框架,通過 Matryoshka 嵌入學習來訓練阿拉伯語嵌套嵌入模型,利用多語言、阿拉伯語特定和基於英語的模型,突顯了嵌套嵌入模型在各種阿拉伯語自然語言處理下游任務中的優勢。我們的創新貢獻包括將各種句子相似度數據集翻譯成阿拉伯語,從而實現一個全面的評估框架,以比較這些模型在不同維度上的表現。我們在阿拉伯語自然語言推理三元組數據集上訓練了幾個嵌套嵌入模型,並使用多個評估指標進行評估,包括餘弦相似度、曼哈頓距離、歐氏距離和點積相似度的皮爾遜和斯皮爾曼相關性。結果表明,Matryoshka 嵌入模型在捕捉阿拉伯語獨有的語義細微差異方面表現優異。研究結果表明,阿拉伯語 Matryoshka 嵌入模型在捕捉阿拉伯語獨有的語義細微差異方面表現優異,在各種相似性指標上明顯優於傳統模型,性能提升高達 20-25%。這些結果凸顯了語言特定訓練的有效性,並突顯了 Matryoshka 模型在增強阿拉伯語自然語言處理中語義文本相似性任務方面的潛力。
English
This work presents a novel framework for training Arabic nested embedding
models through Matryoshka Embedding Learning, leveraging multilingual,
Arabic-specific, and English-based models, to highlight the power of nested
embeddings models in various Arabic NLP downstream tasks. Our innovative
contribution includes the translation of various sentence similarity datasets
into Arabic, enabling a comprehensive evaluation framework to compare these
models across different dimensions. We trained several nested embedding models
on the Arabic Natural Language Inference triplet dataset and assessed their
performance using multiple evaluation metrics, including Pearson and Spearman
correlations for cosine similarity, Manhattan distance, Euclidean distance, and
dot product similarity. The results demonstrate the superior performance of the
Matryoshka embedding models, particularly in capturing semantic nuances unique
to the Arabic language. Results demonstrated that Arabic Matryoshka embedding
models have superior performance in capturing semantic nuances unique to the
Arabic language, significantly outperforming traditional models by up to
20-25\% across various similarity metrics. These results underscore the
effectiveness of language-specific training and highlight the potential of
Matryoshka models in enhancing semantic textual similarity tasks for Arabic
NLP.