DyVo:用于实体学习稀疏检索的动态词汇
DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities
October 10, 2024
作者: Thong Nguyen, Shubham Chatterjee, Sean MacAvaney, Iain Mackie, Jeff Dalton, Andrew Yates
cs.AI
摘要
学习稀疏检索(LSR)模型使用来自预训练变换器的词汇,这些词汇通常将实体分割为毫无意义的片段。分割实体可能会降低检索准确性,并限制模型吸收未包含在训练数据中的最新世界知识的能力。在这项工作中,我们通过维基百科的概念和实体增强了LSR词汇,使模型能够更有效地解决歧义并与不断更新的知识保持同步。我们方法的核心是动态词汇(DyVo)头部,它利用现有的实体嵌入和一个实体检索组件,识别与查询或文档相关的实体。我们使用DyVo头部生成实体权重,然后将其与单词片段权重合并,以创建联合表示,用于使用倒排索引进行高效索引和检索。在三个实体丰富的文档排名数据集上的实验中,得到的DyVo模型明显优于最先进的基线模型。
English
Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained
transformers, which often split entities into nonsensical fragments. Splitting
entities can reduce retrieval accuracy and limits the model's ability to
incorporate up-to-date world knowledge not included in the training data. In
this work, we enhance the LSR vocabulary with Wikipedia concepts and entities,
enabling the model to resolve ambiguities more effectively and stay current
with evolving knowledge. Central to our approach is a Dynamic Vocabulary (DyVo)
head, which leverages existing entity embeddings and an entity retrieval
component that identifies entities relevant to a query or document. We use the
DyVo head to generate entity weights, which are then merged with word piece
weights to create joint representations for efficient indexing and retrieval
using an inverted index. In experiments across three entity-rich document
ranking datasets, the resulting DyVo model substantially outperforms
state-of-the-art baselines.Summary
AI-Generated Summary