MMTEB:大規模多語言文本嵌入基準測試
MMTEB: Massive Multilingual Text Embedding Benchmark
February 19, 2025
作者: Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poświata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Philipp Harries, Loïc Magne, Isabelle Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Šuppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lù, Jordan Clive, Gayatri Krishnakumar, Anna Maksimova, Silvan Wehrli, Maria Tikhonova, Henil Panchal, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jimmy Lin, Howard Yen, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy, Niklas Muennighoff
cs.AI
摘要
文本嵌入模型通常僅在有限的任務集上進行評估,這些任務受到語言、領域和任務多樣性的限制。為了解決這些限制並提供更全面的評估,我們引入了大規模多語言文本嵌入基準(MMTEB)——這是對MTEB的一次大規模、社群驅動的擴展,涵蓋了超過500個質量控制的評估任務,涉及250多種語言。MMTEB包含了一系列多樣且具有挑戰性的新任務,如指令遵循、長文檔檢索和代碼檢索,代表了迄今為止最大的多語言嵌入模型評估任務集合。利用這一集合,我們開發了幾個高度多語言的基準,並用它們來評估一組具有代表性的模型。我們發現,雖然擁有數十億參數的大型語言模型(LLMs)在某些語言子集和任務類別上能夠達到最先進的性能,但公開可用的最佳表現模型是僅有5.6億參數的multilingual-e5-large-instruct。為了提高可訪問性並降低計算成本,我們引入了一種基於任務間相關性的新穎下采樣方法,確保了多樣性選擇的同時保留了模型的相對排名。此外,我們通過採樣困難負例來優化檢索等任務,創建了更小但有效的分割。這些優化使我們能夠引入大幅降低計算需求的基準。例如,我們新引入的零樣本英語基準在保持與完整版本相似排名順序的同時,僅需極少的計算成本。
English
Text embeddings are typically evaluated on a limited set of tasks, which are
constrained by language, domain, and task diversity. To address these
limitations and provide a more comprehensive evaluation, we introduce the
Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale,
community-driven expansion of MTEB, covering over 500 quality-controlled
evaluation tasks across 250+ languages. MMTEB includes a diverse set of
challenging, novel tasks such as instruction following, long-document
retrieval, and code retrieval, representing the largest multilingual collection
of evaluation tasks for embedding models to date. Using this collection, we
develop several highly multilingual benchmarks, which we use to evaluate a
representative set of models. We find that while large language models (LLMs)
with billions of parameters can achieve state-of-the-art performance on certain
language subsets and task categories, the best-performing publicly available
model is multilingual-e5-large-instruct with only 560 million parameters. To
facilitate accessibility and reduce computational cost, we introduce a novel
downsampling method based on inter-task correlation, ensuring a diverse
selection while preserving relative model rankings. Furthermore, we optimize
tasks such as retrieval by sampling hard negatives, creating smaller but
effective splits. These optimizations allow us to introduce benchmarks that
drastically reduce computational demands. For instance, our newly introduced
zero-shot English benchmark maintains a ranking order similar to the full-scale
version but at a fraction of the computational cost.Summary
AI-Generated Summary