多语言 E5 文本嵌入:技术报告
Multilingual E5 Text Embeddings: A Technical Report
February 8, 2024
作者: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei
cs.AI
摘要
本技术报告介绍了于2023年中发布的开源多语言E5文本嵌入模型的训练方法论和评估结果。提供了三种不同大小(小/基础/大)的嵌入模型,平衡了推理效率和嵌入质量。训练过程遵循英文E5模型的配方,涉及对10亿个多语言文本对进行对比预训练,然后在一组标记数据集上进行微调。此外,我们引入了一种新的指令调整的嵌入模型,其性能与同等大小的最先进的仅英文模型相当。有关模型发布的信息可在https://github.com/microsoft/unilm/tree/master/e5 找到。
English
This technical report presents the training methodology and evaluation
results of the open-source multilingual E5 text embedding models, released in
mid-2023. Three embedding models of different sizes (small / base / large) are
provided, offering a balance between the inference efficiency and embedding
quality. The training procedure adheres to the English E5 model recipe,
involving contrastive pre-training on 1 billion multilingual text pairs,
followed by fine-tuning on a combination of labeled datasets. Additionally, we
introduce a new instruction-tuned embedding model, whose performance is on par
with state-of-the-art, English-only models of similar sizes. Information
regarding the model release can be found at
https://github.com/microsoft/unilm/tree/master/e5 .