ChatPaper.aiChatPaper

俄羅斯專注的嵌入式探索:ruMTEB基準測試和俄羅斯嵌入式模型設計

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

August 22, 2024
作者: Artem Snegirev, Maria Tikhonova, Anna Maksimova, Alena Fenogenova, Alexander Abramov
cs.AI

摘要

嵌入模型在自然語言處理(NLP)中扮演著重要角色,通過創建文本嵌入來支持各種任務,如信息檢索和評估語義文本相似性。本文專注於與俄語嵌入模型相關的研究。它介紹了一個名為ru-en-RoSBERTa的新俄語嵌入模型,以及ruMTEB基準,這是擴展了大規模文本嵌入基準(MTEB)的俄語版本。我們的基準包括七個任務類別,如語義文本相似性、文本分類、重新排名和檢索。該研究還評估了一組代表性的俄語和多語言模型在所提出的基準上的表現。研究結果顯示,新模型在俄語方面取得了與最先進模型相當的結果。我們釋出了ru-en-RoSBERTa模型,而ruMTEB框架附帶開源代碼、集成到原始框架以及一個公開排行榜。
English
Embedding models play a crucial role in Natural Language Processing (NLP) by creating text embeddings used in various tasks such as information retrieval and assessing semantic text similarity. This paper focuses on research related to embedding models in the Russian language. It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark, the Russian version extending the Massive Text Embedding Benchmark (MTEB). Our benchmark includes seven categories of tasks, such as semantic textual similarity, text classification, reranking, and retrieval. The research also assesses a representative set of Russian and multilingual models on the proposed benchmark. The findings indicate that the new model achieves results that are on par with state-of-the-art models in Russian. We release the model ru-en-RoSBERTa, and the ruMTEB framework comes with open-source code, integration into the original framework and a public leaderboard.

Summary

AI-Generated Summary

PDF251November 16, 2024