上下文是寻找黄金段落的金钥匙:评估与训练上下文文档嵌入
Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings
May 30, 2025
作者: Max Conti, Manuel Faysse, Gautier Viaud, Antoine Bosselut, Céline Hudelot, Pierre Colombo
cs.AI
摘要
现代文档检索嵌入方法的一个局限在于,它们通常独立编码同一文档中的段落(片段),往往忽视了文档其余部分的关键上下文信息,而这些信息本可大幅提升单个片段的表征质量。
在本研究中,我们提出了ConTEB(上下文感知文本嵌入基准),这是一个旨在评估检索模型利用文档全局上下文能力的基准。我们的结果表明,在需要上下文的检索场景中,当前最先进的嵌入模型表现欠佳。针对这一局限,我们提出了InSeNT(序列内负样本训练),一种新颖的对比式后训练方法,结合延迟片段池化技术,在保持计算效率的同时增强了上下文表征学习。我们的方法在ConTEB上显著提升了检索质量,且未牺牲基础模型性能。进一步发现,采用我们方法嵌入的片段对次优片段划分策略及更大规模检索语料库展现出更强的鲁棒性。我们已在https://github.com/illuin-tech/contextual-embeddings开源所有相关资源。
English
A limitation of modern document retrieval embedding methods is that they
typically encode passages (chunks) from the same documents independently, often
overlooking crucial contextual information from the rest of the document that
could greatly improve individual chunk representations.
In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a
benchmark designed to evaluate retrieval models on their ability to leverage
document-wide context. Our results show that state-of-the-art embedding models
struggle in retrieval scenarios where context is required. To address this
limitation, we propose InSeNT (In-sequence Negative Training), a novel
contrastive post-training approach which combined with late chunking pooling
enhances contextual representation learning while preserving computational
efficiency. Our method significantly improves retrieval quality on ConTEB
without sacrificing base model performance. We further find chunks embedded
with our method are more robust to suboptimal chunking strategies and larger
retrieval corpus sizes. We open-source all artifacts at
https://github.com/illuin-tech/contextual-embeddings.Summary
AI-Generated Summary