GATE: 향상된 의미적 텍스트 유사성을 위한 일반 아랍어 텍스트 임베딩 - 마트료시카 표현 학습과 하이브리드 손실 훈련 기법 적용

초록

의미적 텍스트 유사성(Semantic Textual Similarity, STS)은 자연어 처리(NLP)에서 중요한 과제로, 정보 검색, 클러스터링, 그리고 텍스트 간의 의미적 관계 이해와 같은 응용 분야를 가능하게 합니다. 그러나 아랍어에 대한 이 분야의 연구는 고품질 데이터셋과 사전 학습된 모델의 부족으로 인해 여전히 제한적입니다. 이러한 자원의 부족은 아랍어 텍스트의 의미적 유사성에 대한 정확한 평가와 발전을 제한해 왔습니다. 본 논문은 MTEB 벤치마크 내에서 의미적 텍스트 유사성 작업에서 최첨단 성능을 달성하는 General Arabic Text Embedding(GATE) 모델을 소개합니다. GATE는 Matryoshka Representation Learning과 아랍어 트리플릿 데이터셋을 활용한 하이브리드 손실 훈련 방식을 사용하여, 미세한 의미적 이해가 요구되는 작업에서 모델 성능을 향상시키는 데 필수적인 자연어 추론을 수행합니다. GATE는 OpenAI를 포함한 더 큰 모델들을 능가하며, STS 벤치마크에서 20-25%의 성능 향상을 보여주며, 아랍어의 독특한 의미적 뉘앙스를 효과적으로 포착합니다.

English

Semantic textual similarity (STS) is a critical task in natural language processing (NLP), enabling applications in retrieval, clustering, and understanding semantic relationships between texts. However, research in this area for the Arabic language remains limited due to the lack of high-quality datasets and pre-trained models. This scarcity of resources has restricted the accurate evaluation and advance of semantic similarity in Arabic text. This paper introduces General Arabic Text Embedding (GATE) models that achieve state-of-the-art performance on the Semantic Textual Similarity task within the MTEB benchmark. GATE leverages Matryoshka Representation Learning and a hybrid loss training approach with Arabic triplet datasets for Natural Language Inference, which are essential for enhancing model performance in tasks that demand fine-grained semantic understanding. GATE outperforms larger models, including OpenAI, with a 20-25% performance improvement on STS benchmarks, effectively capturing the unique semantic nuances of Arabic.

GATE: 향상된 의미적 텍스트 유사성을 위한 일반 아랍어 텍스트 임베딩 - 마트료시카 표현 학습과 하이브리드 손실 훈련 기법 적용

GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

초록

Support