分心效應:理解RAG中的無關段落
The Distracting Effect: Understanding Irrelevant Passages in RAG
May 11, 2025
作者: Chen Amiraz, Florin Cuconasu, Simone Filice, Zohar Karnin
cs.AI
摘要
檢索增強生成(RAG)系統中一個廣為人知的問題是,檢索到的與查詢無關的段落有時會干擾生成答案的大型語言模型(LLM),導致其提供錯誤的回應。本文深入探討了這一核心問題,並針對查詢(及LLM)定義了段落的干擾效應。我們提出了一種量化段落干擾效應的方法,並展示了該方法在不同LLM間的穩健性。
我們的研究引入了新穎的方法來識別並利用具有強干擾性的段落,從而改進RAG系統。通過使用這些精心挑選的干擾段落對LLM進行微調,我們在回答準確率上相比於基於傳統RAG數據集微調的模型提升了高達7.5%。我們的貢獻體現在兩個方面:首先,我們超越了將無關段落簡單二分為完全不相關與具有干擾性的傳統做法;其次,我們開發並分析了多種尋找強干擾段落的方法。據我們所知,目前尚無其他研究提供如此全面的框架來識別和利用強干擾段落。
English
A well-known issue with Retrieval Augmented Generation (RAG) is that
retrieved passages that are irrelevant to the query sometimes distract the
answer-generating LLM, causing it to provide an incorrect response. In this
paper, we shed light on this core issue and formulate the distracting effect of
a passage w.r.t. a query (and an LLM). We provide a quantifiable measure of the
distracting effect of a passage and demonstrate its robustness across LLMs.
Our research introduces novel methods for identifying and using hard
distracting passages to improve RAG systems. By fine-tuning LLMs with these
carefully selected distracting passages, we achieve up to a 7.5% increase in
answering accuracy compared to counterparts fine-tuned on conventional RAG
datasets. Our contribution is two-fold: first, we move beyond the simple binary
classification of irrelevant passages as either completely unrelated vs.
distracting, and second, we develop and analyze multiple methods for finding
hard distracting passages. To our knowledge, no other research has provided
such a comprehensive framework for identifying and utilizing hard distracting
passages.Summary
AI-Generated Summary