密集檢索器的崩潰：短暫、早期與字面偏見超越事實證據的排序

摘要

密集檢索模型在資訊檢索（IR）應用中廣泛使用，例如檢索增強生成（RAG）。由於它們通常作為這些系統的第一步，其穩健性對於避免失敗至關重要。在本研究中，我們通過重新利用一個關係抽取數據集（如Re-DocRED），設計了控制實驗來量化啟發式偏見（如偏好較短文檔）對檢索器（如Dragon+和Contriever）的影響。我們的研究揭示了顯著的脆弱性：檢索器往往依賴於表面模式，如過度優先考慮文檔開頭、較短文檔、重複實體和字面匹配。此外，它們往往忽略了文檔是否包含查詢的答案，缺乏深層的語義理解。值得注意的是，當多種偏見結合時，模型表現出災難性的性能下降，在不到3%的情況下選擇了包含答案的文檔，而不是沒有答案的偏見文檔。此外，我們還展示了這些偏見對下游應用（如RAG）的直接影響，其中檢索偏好的文檔可能會誤導大型語言模型（LLM），導致性能下降34%，甚至比不提供任何文檔更差。

English

Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.

密集檢索器的崩潰：短暫、早期與字面偏見超越事實證據的排序

Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

摘要

Support