分心效应:理解RAG模型中的无关段落
The Distracting Effect: Understanding Irrelevant Passages in RAG
May 11, 2025
作者: Chen Amiraz, Florin Cuconasu, Simone Filice, Zohar Karnin
cs.AI
摘要
检索增强生成(RAG)系统面临的一个众所周知的问题是,检索到的与查询无关的段落有时会干扰答案生成的大语言模型(LLM),导致其提供错误响应。本文深入探讨了这一核心问题,并针对查询(及LLM)提出了段落干扰效应的量化定义。我们提供了一种可量化的段落干扰效应度量方法,并证明了该方法在不同LLM间的鲁棒性。
我们的研究引入了识别和利用高难度干扰段落以改进RAG系统的新方法。通过使用这些精心挑选的干扰段落对LLM进行微调,我们实现了相较于基于传统RAG数据集微调的模型高达7.5%的答案准确率提升。我们的贡献体现在两个方面:首先,我们超越了将无关段落简单二分为完全不相关与干扰性的传统做法;其次,我们开发并分析了多种寻找高难度干扰段落的方法。据我们所知,尚无其他研究提供了如此全面的框架来识别和利用高难度干扰段落。
English
A well-known issue with Retrieval Augmented Generation (RAG) is that
retrieved passages that are irrelevant to the query sometimes distract the
answer-generating LLM, causing it to provide an incorrect response. In this
paper, we shed light on this core issue and formulate the distracting effect of
a passage w.r.t. a query (and an LLM). We provide a quantifiable measure of the
distracting effect of a passage and demonstrate its robustness across LLMs.
Our research introduces novel methods for identifying and using hard
distracting passages to improve RAG systems. By fine-tuning LLMs with these
carefully selected distracting passages, we achieve up to a 7.5% increase in
answering accuracy compared to counterparts fine-tuned on conventional RAG
datasets. Our contribution is two-fold: first, we move beyond the simple binary
classification of irrelevant passages as either completely unrelated vs.
distracting, and second, we develop and analyze multiple methods for finding
hard distracting passages. To our knowledge, no other research has provided
such a comprehensive framework for identifying and utilizing hard distracting
passages.Summary
AI-Generated Summary