上下文为金，寻金段落：评估与训练上下文文档嵌入

摘要

现代文档检索嵌入方法的一个局限在于，它们通常独立编码来自同一文档的段落（块），往往忽视了文档其余部分的关键上下文信息，这些信息本可大幅提升单个块的表征质量。在本研究中，我们引入了ConTEB（上下文感知文本嵌入基准），这是一个旨在评估检索模型利用文档范围上下文能力的基准。我们的结果表明，在需要上下文的检索场景中，最先进的嵌入模型表现欠佳。为应对这一局限，我们提出了InSeNT（序列内负样本训练），这是一种新颖的对比后训练方法，结合延迟分块池化，既增强了上下文表征学习，又保持了计算效率。我们的方法在ConTEB上显著提升了检索质量，且未牺牲基础模型的性能。进一步发现，采用我们方法嵌入的块在面对次优分块策略及更大检索语料库规模时表现出更强的鲁棒性。我们已在https://github.com/illuin-tech/contextual-embeddings开源所有相关资源。

English

A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations. In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes. We open-source all artifacts at https://github.com/illuin-tech/contextual-embeddings.

上下文为金，寻金段落：评估与训练上下文文档嵌入

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings

摘要

Support