BM25S:透過積極稀疏評分實現數量級更快速的詞彙檢索
BM25S: Orders of magnitude faster lexical search via eager sparse scoring
July 4, 2024
作者: Xing Han Lù
cs.AI
摘要
我們介紹了BM25S,這是一個高效的基於Python的BM25實現,僅依賴於Numpy和Scipy。相較於最流行的基於Python的框架,BM25S實現了高達500倍的加速,透過在索引期間積極計算BM25分數並將其存儲到稀疏矩陣中。它還實現了比高度優化的基於Java的實現更大的加速,這些實現被流行的商業產品使用。最後,BM25S通過將積極評分擴展到非稀疏變體,並使用一種新穎的分數偏移方法,重現了基於Kamphuis等人(2020年)的五種BM25變體的確切實現。代碼可在https://github.com/xhluca/bm25s 找到。
English
We introduce BM25S, an efficient Python-based implementation of BM25 that
only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared
to the most popular Python-based framework by eagerly computing BM25 scores
during indexing and storing them into sparse matrices. It also achieves
considerable speedups compared to highly optimized Java-based implementations,
which are used by popular commercial products. Finally, BM25S reproduces the
exact implementation of five BM25 variants based on Kamphuis et al. (2020) by
extending eager scoring to non-sparse variants using a novel score shifting
method. The code can be found at https://github.com/xhluca/bm25sSummary
AI-Generated Summary