ChatPaper.aiChatPaper

多語言 E5 文本嵌入:技術報告

Multilingual E5 Text Embeddings: A Technical Report

February 8, 2024
作者: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei
cs.AI

摘要

本技術報告介紹了於2023年中期發布的開源多語言E5文本嵌入模型的訓練方法和評估結果。該模型提供三種不同大小(小型/基礎/大型)的嵌入模型,平衡了推理效率和嵌入質量。訓練過程遵循英文E5模型的配方,包括在10億個多語言文本對上進行對比預訓練,然後在一組標記數據集上進行微調。此外,我們引入了一個新的指令調整的嵌入模型,其性能與同等大小的最先進的僅英文模型相當。有關模型發布的信息可在https://github.com/microsoft/unilm/tree/master/e5 找到。
English
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .
PDF234December 15, 2024