GATE：通用阿拉伯语文本嵌入——通过嵌套表示学习与混合损失训练提升语义文本相似性

摘要

语义文本相似度（Semantic Textual Similarity, STS）是自然语言处理（NLP）中的一项关键任务，广泛应用于检索、聚类及理解文本间语义关系等领域。然而，由于缺乏高质量数据集和预训练模型，针对阿拉伯语的STS研究仍显不足。这种资源匮乏限制了阿拉伯语文本语义相似度的准确评估与进展。本文提出了通用阿拉伯文本嵌入模型（General Arabic Text Embedding, GATE），该模型在MTEB基准测试的语义文本相似度任务中达到了业界领先水平。GATE模型结合了嵌套表示学习（Matryoshka Representation Learning）及基于阿拉伯语三元组数据集的混合损失训练方法，这些方法对于提升模型在需要细粒度语义理解任务中的表现至关重要。GATE在STS基准测试中表现优异，较包括OpenAI在内的更大模型实现了20-25%的性能提升，有效捕捉了阿拉伯语独特的语义细微差别。

English

Semantic textual similarity (STS) is a critical task in natural language processing (NLP), enabling applications in retrieval, clustering, and understanding semantic relationships between texts. However, research in this area for the Arabic language remains limited due to the lack of high-quality datasets and pre-trained models. This scarcity of resources has restricted the accurate evaluation and advance of semantic similarity in Arabic text. This paper introduces General Arabic Text Embedding (GATE) models that achieve state-of-the-art performance on the Semantic Textual Similarity task within the MTEB benchmark. GATE leverages Matryoshka Representation Learning and a hybrid loss training approach with Arabic triplet datasets for Natural Language Inference, which are essential for enhancing model performance in tasks that demand fine-grained semantic understanding. GATE outperforms larger models, including OpenAI, with a 20-25% performance improvement on STS benchmarks, effectively capturing the unique semantic nuances of Arabic.

GATE：通用阿拉伯语文本嵌入——通过嵌套表示学习与混合损失训练提升语义文本相似性

GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

摘要

Support