Jina-ColBERT-v2:一個通用多語言後交互檢索器
Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
August 29, 2024
作者: Rohan Jha, Bo Wang, Michael Günther, Saba Sturua, Mohammad Kalim Akram, Han Xiao
cs.AI
摘要
多向量密集模型,如 ColBERT,在資訊檢索中已被證明極為有效。ColBERT 的後期交互作用評分近似於交叉編碼器中所見的聯合查詢-文檔注意力,同時保持推論效率接近傳統密集檢索模型,這要歸功於其雙編碼器架構以及最近在索引和搜索方面的優化。在本文中,我們介紹了對 ColBERT 模型架構和訓練流程的幾項改進,利用在更成熟的單向量嵌入模型範式中取得成功的技術,特別是適用於異構多語言數據的技術。我們的新模型 Jina-ColBERT-v2 在各種英語和多語言檢索任務中展現出強大的性能,同時與先前模型相比,還將存儲需求降低了多達 50%。
English
Multi-vector dense models, such as ColBERT, have proven highly effective in
information retrieval. ColBERT's late interaction scoring approximates the
joint query-document attention seen in cross-encoders while maintaining
inference efficiency closer to traditional dense retrieval models, thanks to
its bi-encoder architecture and recent optimizations in indexing and search. In
this paper, we introduce several improvements to the ColBERT model architecture
and training pipeline, leveraging techniques successful in the more established
single-vector embedding model paradigm, particularly those suited for
heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates
strong performance across a range of English and multilingual retrieval tasks,
while also cutting storage requirements by up to 50% compared to previous
models.Summary
AI-Generated Summary