다국어 E5 텍스트 임베딩: 기술 보고서

초록

본 기술 보고서는 2023년 중반에 공개된 오픈소스 다국어 E5 텍스트 임베딩 모델의 훈련 방법론과 평가 결과를 제시한다. 세 가지 크기(소형/기본/대형)의 임베딩 모델이 제공되며, 이는 추론 효율성과 임베딩 품질 간의 균형을 제공한다. 훈련 절차는 영어 E5 모델 레시피를 준수하며, 10억 개의 다국어 텍스트 쌍에 대한 대조적 사전 훈련과 레이블이 지정된 데이터셋의 조합에 대한 미세 조정을 포함한다. 또한, 최신 영어 전용 모델과 유사한 크기의 성능을 보이는 새로운 지시어 조정 임베딩 모델을 소개한다. 모델 공개에 관한 정보는 https://github.com/microsoft/unilm/tree/master/e5에서 확인할 수 있다.

English

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

다국어 E5 텍스트 임베딩: 기술 보고서

Multilingual E5 Text Embeddings: A Technical Report

초록

Support