SitEmb-v1.5：面向語義關聯與長篇故事理解的改進型上下文感知密集檢索模型

摘要

在長文件上進行檢索增強生成（RAG）通常涉及將文本分割成較小的片段，這些片段作為檢索的基本單位。然而，由於原始文件中的依賴關係，上下文信息對於準確解釋每個片段往往至關重要。為了解決這個問題，先前的研究探索了編碼更長的上下文窗口，以生成更長片段的嵌入。儘管有這些努力，檢索和下游任務的提升仍然有限。這是因為（1）更長的片段由於需要編碼的信息量增加，對嵌入模型的容量造成了壓力；（2）許多實際應用由於模型或人類帶寬的限制，仍然需要返回局部化的證據。我們提出了一種替代方法來應對這一挑戰，通過以更廣泛的上下文窗口為條件來表示短片段，從而提升檢索性能——即，將片段的意義置於其上下文中。我們進一步表明，現有的嵌入模型並不能有效地編碼這種情境化的上下文，因此引入了一種新的訓練範式，並開發了情境化嵌入模型（SitEmb）。為了評估我們的方法，我們策劃了一個專門設計來評估情境化檢索能力的書籍情節檢索數據集。在這個基準測試中，我們基於BGE-M3的SitEmb-v1模型顯著優於包括幾個擁有7-8B參數的模型在內的頂尖嵌入模型，而僅有1B參數。我們的8B SitEmb-v1.5模型進一步將性能提升了超過10%，並在多種語言和多個下游應用中顯示出強勁的結果。

English

Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.

SitEmb-v1.5：面向語義關聯與長篇故事理解的改進型上下文感知密集檢索模型

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

摘要

Support