mGTE：通用長文本表示和重新排序模型用於多語言文本檢索

摘要

我們提出了系統性的工作，從頭開始建立長文本多語言表示模型（TRM）和重新排序器，用於文本檢索。我們首先介紹了一個文本編碼器（基本大小），加強了RoPE和去填充，在本機8192令牌上下文（比以前的多語言編碼器的512更長）中進行了預訓練。然後，我們通過對比學習構建了一個混合TRM和交叉編碼器重新排序器。評估顯示，我們的文本編碼器優於相同大小的先前最先進的XLM-R。與此同時，我們的TRM和重新排序器與大型最先進的BGE-M3模型的性能相匹配，在長文本檢索基準測試中取得更好的結果。進一步的分析表明，我們提出的模型在訓練和推斷期間表現出更高的效率。我們相信它們的效率和有效性可以使各種研究和工業應用受益。

English

We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.

mGTE：通用長文本表示和重新排序模型用於多語言文本檢索

mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

摘要

Support