上下文文档嵌入

摘要

密集文档嵌入是神经检索的核心。主导范式是通过直接在单个文档上运行编码器来训练和构建嵌入。在这项工作中，我们认为这些嵌入虽然有效，但对于检索的目标用例来说隐含着脱离上下文，并且一个具有上下文的文档嵌入应该考虑文档本身和上下文中的相邻文档，类似于上下文化词嵌入。我们提出了两种互补的上下文化文档嵌入方法：首先，是一种替代对比学习目标，明确将文档邻居纳入批内上下文损失；其次，是一种新的上下文化架构，明确将邻居文档信息编码到编码表示中。结果表明，这两种方法在多个设置中均比双编码器表现更好，尤其在领域外表现明显。我们在MTEB基准测试中取得了最先进的结果，无需硬负采样、分数蒸馏、特定于数据集的指导、GPU内示例共享或极大的批量大小。我们的方法可用于提高任何对比学习数据集和任何双编码器的性能。

English

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.

上下文文档嵌入

Contextual Document Embeddings

摘要

Support