SitEmb-v1.5：面向语义关联与长篇故事理解的增强型上下文感知密集检索

摘要

针对长文档的检索增强生成（RAG）通常需要将文本分割成较小的片段，作为检索的基本单元。然而，由于原文档中存在跨片段依赖关系，上下文信息对于准确解读每个片段往往至关重要。为此，先前的研究探索了通过编码更长的上下文窗口来生成更长片段的嵌入表示。尽管如此，检索及下游任务的提升依然有限，原因在于：(1) 更长的片段因需编码更多信息而给嵌入模型带来容量压力；(2) 许多实际应用因模型或人类处理能力的限制，仍需返回局部证据。我们提出了一种替代方案，通过以更广泛的上下文窗口为条件来表征短片段，从而提升检索性能——即在一个片段的上下文中定位其含义。我们进一步指出，现有嵌入模型在有效编码此类情境化上下文方面存在不足，因此引入了一种新的训练范式，并开发了情境化嵌入模型（SitEmb）。为评估我们的方法，我们专门构建了一个书籍情节检索数据集，旨在测试情境化检索能力。在此基准测试中，基于BGE-M3的SitEmb-v1模型仅凭10亿参数，便显著超越了包括多个拥有70亿至80亿参数在内的最先进嵌入模型。而我们的80亿参数SitEmb-v1.5模型更进一步，性能提升超过10%，并在多种语言及多个下游应用中展现出强劲表现。

English

Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.

SitEmb-v1.5：面向语义关联与长篇故事理解的增强型上下文感知密集检索

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

摘要

Support