ChatPaper.aiChatPaper

BM25S:通过急切稀疏评分实现数量级更快的词汇搜索

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

July 4, 2024
作者: Xing Han Lù
cs.AI

摘要

我们介绍了BM25S,这是一个基于Python的高效实现,仅依赖于Numpy和Scipy。与最流行的基于Python的框架相比,BM25S 的速度提高了多达500倍,这是通过在索引期间急切计算BM25分数并将其存储到稀疏矩阵中实现的。与高度优化的基于Java的实现相比,BM25S 也实现了相当大的速度提升,这些实现被流行的商业产品所使用。最后,BM25S 通过将急切评分扩展到非稀疏变体,并使用一种新颖的分数偏移方法,复制了基于Kamphuis等人(2020年)的五种BM25变体的确切实现。代码可在 https://github.com/xhluca/bm25s 找到。
English
We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at https://github.com/xhluca/bm25s

Summary

AI-Generated Summary

PDF133November 28, 2024