上下文文档嵌入
Contextual Document Embeddings
October 3, 2024
作者: John X. Morris, Alexander M. Rush
cs.AI
摘要
密集文档嵌入是神经检索的核心。主导范式是通过直接在单个文档上运行编码器来训练和构建嵌入。在这项工作中,我们认为这些嵌入虽然有效,但对于检索的目标用例来说隐含着脱离上下文,并且一个具有上下文的文档嵌入应该考虑文档本身和上下文中的相邻文档,类似于上下文化词嵌入。我们提出了两种互补的上下文化文档嵌入方法:首先,是一种替代对比学习目标,明确将文档邻居纳入批内上下文损失;其次,是一种新的上下文化架构,明确将邻居文档信息编码到编码表示中。结果表明,这两种方法在多个设置中均比双编码器表现更好,尤其在领域外表现明显。我们在MTEB基准测试中取得了最先进的结果,无需硬负采样、分数蒸馏、特定于数据集的指导、GPU内示例共享或极大的批量大小。我们的方法可用于提高任何对比学习数据集和任何双编码器的性能。
English
Dense document embeddings are central to neural retrieval. The dominant
paradigm is to train and construct embeddings by running encoders directly on
individual documents. In this work, we argue that these embeddings, while
effective, are implicitly out-of-context for targeted use cases of retrieval,
and that a contextualized document embedding should take into account both the
document and neighboring documents in context - analogous to contextualized
word embeddings. We propose two complementary methods for contextualized
document embeddings: first, an alternative contrastive learning objective that
explicitly incorporates the document neighbors into the intra-batch contextual
loss; second, a new contextual architecture that explicitly encodes neighbor
document information into the encoded representation. Results show that both
methods achieve better performance than biencoders in several settings, with
differences especially pronounced out-of-domain. We achieve state-of-the-art
results on the MTEB benchmark with no hard negative mining, score distillation,
dataset-specific instructions, intra-GPU example-sharing, or extremely large
batch sizes. Our method can be applied to improve performance on any
contrastive learning dataset and any biencoder.Summary
AI-Generated Summary