ChatPaper.aiChatPaper

俄罗斯重点嵌入器的探索:ruMTEB基准和俄语嵌入模型设计

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

August 22, 2024
作者: Artem Snegirev, Maria Tikhonova, Anna Maksimova, Alena Fenogenova, Alexander Abramov
cs.AI

摘要

嵌入模型在自然语言处理(NLP)中扮演着关键角色,通过创建文本嵌入来支持各种任务,如信息检索和评估语义文本相似性。本文专注于俄语领域的嵌入模型研究。介绍了一种新的俄语专用嵌入模型,名为ru-en-RoSBERTa,以及ruMTEB基准,是Massive Text Embedding Benchmark(MTEB)的俄语版本扩展。我们的基准包括七类任务,如语义文本相似性、文本分类、重新排序和检索等。研究还评估了一组代表性的俄语和多语言模型在提出的基准上的表现。研究结果表明,新模型在俄语领域的表现与最先进模型持平。我们发布了ru-en-RoSBERTa模型,ruMTEB框架附带开源代码、集成到原始框架以及公开排行榜。
English
Embedding models play a crucial role in Natural Language Processing (NLP) by creating text embeddings used in various tasks such as information retrieval and assessing semantic text similarity. This paper focuses on research related to embedding models in the Russian language. It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark, the Russian version extending the Massive Text Embedding Benchmark (MTEB). Our benchmark includes seven categories of tasks, such as semantic textual similarity, text classification, reranking, and retrieval. The research also assesses a representative set of Russian and multilingual models on the proposed benchmark. The findings indicate that the new model achieves results that are on par with state-of-the-art models in Russian. We release the model ru-en-RoSBERTa, and the ruMTEB framework comes with open-source code, integration into the original framework and a public leaderboard.

Summary

AI-Generated Summary

PDF251November 16, 2024