通过令牌池化在最小性能影响下减少多向量检索的印记

摘要

在过去几年中，由ColBERT带头的多向量检索方法已成为神经信息检索中越来越流行的方法。通过在标记级别而非文档级别存储表示，这些方法展示了非常强大的检索性能，特别是在领域外的情况下。然而，存储大量相关向量所需的存储和内存需求仍然是一个重要的缺点，阻碍了实际采用。在本文中，我们引入了一种基于聚类的标记汇集方法，以侵略性地减少需要存储的向量数量。这种方法可以将ColBERT索引的空间和内存占用减少50%，几乎不会降低检索性能。该方法还可以进一步减少向量数量，将其减少66%至75%，在绝大多数数据集上，性能降低保持在5%以下。重要的是，这种方法无需进行架构更改或查询时处理，可以作为一种简单的插件在索引时与任何类似ColBERT的模型一起使用。

English

Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.

通过令牌池化在最小性能影响下减少多向量检索的印记

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

摘要

Support