ChatPaper.aiChatPaper

大语言模型时代检索评估的重构

Redefining Retrieval Evaluation in the Era of LLMs

October 24, 2025
作者: Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri
cs.AI

摘要

传统信息检索(IR)指标(如nDCG、MAP和MRR)基于用户按顺序检视文档且对低排名文档关注度递减的假设。这一假设在检索增强生成(RAG)系统中不再适用,因为检索结果由大语言模型(LLM)处理——与人类不同,LLM会将所有检索文档作为整体处理而非顺序检视。此外,传统IR指标未考虑相关但无关的文档会主动降低生成质量,而非仅被忽略。由于这两大错配因素(即人类与机器的位置衰减差异、人类相关性标准与机器效用标准的差异),经典IR指标无法准确预测RAG性能。我们提出一种基于效用的标注框架,可同时量化相关段落的正面贡献和干扰段落的负面影响。在此基础上,我们设计了UDCG(效用与干扰感知累积增益)指标,采用面向LLM的位置衰减机制,直接优化与端到端答案准确率的关联性。在五个数据集和六种LLM上的实验表明,UDCG相较于传统指标将相关性提升最高达36%。本研究为IR评估与LLM消费者对齐迈出关键一步,为RAG组件提供了更可靠的评估方法。
English
Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components
PDF72December 17, 2025