大语言模型时代检索评估体系的重构
Redefining Retrieval Evaluation in the Era of LLMs
October 24, 2025
作者: Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri
cs.AI
摘要
傳統資訊檢索(IR)指標(如nDCG、MAP和MRR)基於一個假設:人類使用者會依序審閱文件,且對低排名內容的注意力遞減。這一假設在檢索增強生成(RAG)系統中並不成立,因為檢索結果是由大型語言模型(LLM)處理的。與人類不同,LLM會將所有檢索到的文件作為整體處理,而非依序審閱。此外,傳統IR指標未能考量「相關但無效文件」對生成品質的主動損害作用(而非僅被忽略)。由於兩大關鍵錯配——人類與機器的位置衰減差異、人類相關性與機器效用標準的區別——經典IR指標無法準確預測RAG效能。我們提出一種基於效用的註解框架,可同時量化相關段落的正面貢獻與干擾段落的負面影響。在此基礎上,我們設計了UDCG(效用與干擾感知累計增益)指標,採用面向LLM的位置衰減機制,直接優化與端到端答案準確度的關聯性。在五個資料集和六種LLM上的實驗表明,UDCG相較傳統指標最高可提升36%的關聯性。本研究為對齊IR評估與LLM消費需求邁出關鍵一步,為RAG組件提供了更可靠的評估方法。
English
Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR,
assume that human users sequentially examine documents with diminishing
attention to lower ranks. This assumption breaks down in Retrieval Augmented
Generation (RAG) systems, where search results are consumed by Large Language
Models (LLMs), which, unlike humans, process all retrieved documents as a whole
rather than sequentially. Additionally, traditional IR metrics do not account
for related but irrelevant documents that actively degrade generation quality,
rather than merely being ignored. Due to these two major misalignments, namely
human vs. machine position discount and human relevance vs. machine utility,
classical IR metrics do not accurately predict RAG performance. We introduce a
utility-based annotation schema that quantifies both the positive contribution
of relevant passages and the negative impact of distracting ones. Building on
this foundation, we propose UDCG (Utility and Distraction-aware Cumulative
Gain), a metric using an LLM-oriented positional discount to directly optimize
the correlation with the end-to-end answer accuracy. Experiments on five
datasets and six LLMs demonstrate that UDCG improves correlation by up to 36%
compared to traditional metrics. Our work provides a critical step toward
aligning IR evaluation with LLM consumers and enables more reliable assessment
of RAG components