GATE：通用阿拉伯語文本嵌入技術，通過套娃式表示學習與混合損失訓練提升語義文本相似度

摘要

語義文本相似度（Semantic Textual Similarity, STS）是自然語言處理（Natural Language Processing, NLP）中的一項關鍵任務，其應用涵蓋檢索、聚類以及理解文本間的語義關係。然而，由於缺乏高品質的數據集和預訓練模型，針對阿拉伯語的此類研究仍顯不足。這種資源的匱乏限制了阿拉伯語文本語義相似度的準確評估與發展。本文介紹了通用阿拉伯語文本嵌入模型（General Arabic Text Embedding, GATE），該模型在MTEB基準測試中的語義文本相似度任務上達到了最先進的性能。GATE利用嵌套表示學習（Matryoshka Representation Learning）及結合阿拉伯語三元組數據集的混合損失訓練方法，這些數據集專為自然語言推理設計，對於提升模型在需要細粒度語義理解任務中的表現至關重要。GATE在STS基準測試中的表現超越了包括OpenAI在內的更大模型，性能提升了20-25%，有效捕捉了阿拉伯語獨特的語義細微差別。

English

Semantic textual similarity (STS) is a critical task in natural language processing (NLP), enabling applications in retrieval, clustering, and understanding semantic relationships between texts. However, research in this area for the Arabic language remains limited due to the lack of high-quality datasets and pre-trained models. This scarcity of resources has restricted the accurate evaluation and advance of semantic similarity in Arabic text. This paper introduces General Arabic Text Embedding (GATE) models that achieve state-of-the-art performance on the Semantic Textual Similarity task within the MTEB benchmark. GATE leverages Matryoshka Representation Learning and a hybrid loss training approach with Arabic triplet datasets for Natural Language Inference, which are essential for enhancing model performance in tasks that demand fine-grained semantic understanding. GATE outperforms larger models, including OpenAI, with a 20-25% performance improvement on STS benchmarks, effectively capturing the unique semantic nuances of Arabic.

GATE：通用阿拉伯語文本嵌入技術，通過套娃式表示學習與混合損失訓練提升語義文本相似度

GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

摘要

Support