上下文是寻找黄金段落的金钥匙：评估与训练上下文文档嵌入

摘要

现代文档检索嵌入方法的一个局限在于，它们通常独立编码同一文档中的段落（片段），往往忽视了文档其余部分的关键上下文信息，而这些信息本可大幅提升单个片段的表征质量。在本研究中，我们提出了ConTEB（上下文感知文本嵌入基准），这是一个旨在评估检索模型利用文档全局上下文能力的基准。我们的结果表明，在需要上下文的检索场景中，当前最先进的嵌入模型表现欠佳。针对这一局限，我们提出了InSeNT（序列内负样本训练），一种新颖的对比式后训练方法，结合延迟片段池化技术，在保持计算效率的同时增强了上下文表征学习。我们的方法在ConTEB上显著提升了检索质量，且未牺牲基础模型性能。进一步发现，采用我们方法嵌入的片段对次优片段划分策略及更大规模检索语料库展现出更强的鲁棒性。我们已在https://github.com/illuin-tech/contextual-embeddings开源所有相关资源。

English

A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations. In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes. We open-source all artifacts at https://github.com/illuin-tech/contextual-embeddings.

上下文是寻找黄金段落的金钥匙：评估与训练上下文文档嵌入

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings

摘要

Support