事实、检索和推理:一种统一的检索增强生成评估
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
September 19, 2024
作者: Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui
cs.AI
摘要
大型语言模型(LLMs)已经在各种认知任务中展示出显著的性能提升。一个新兴的应用是利用LLMs增强检索增强生成(RAG)能力。这些系统要求LLMs理解用户查询,检索相关信息,并合成连贯准确的回应。鉴于这类系统在现实世界中的部署日益增多,全面的评估变得至关重要。为此,我们提出了FRAMES(Factuality, Retrieval, And reasoning MEasurement Set),这是一个高质量的评估数据集,旨在测试LLMs提供事实性回应的能力,评估检索能力,并评估生成最终答案所需的推理能力。虽然先前的工作提供了用于分别评估这些能力的数据集和基准,但FRAMES提供了一个统一框架,更清晰地展示了LLMs在端到端RAG场景中的性能。我们的数据集包含具有挑战性的多跳问题,需要整合来自多个来源的信息。我们提供了基准结果,表明即使是最先进的LLMs在这项任务中也存在困难,没有检索时的准确率为0.40。我们提出的多步检索流程显著提高了准确率,达到了0.66(提高了50%以上)。我们希望我们的工作将有助于弥合评估差距,并帮助开发更加健壮和有能力的RAG系统。
English
Large Language Models (LLMs) have demonstrated significant performance
improvements across various cognitive tasks. An emerging application is using
LLMs to enhance retrieval-augmented generation (RAG) capabilities. These
systems require LLMs to understand user queries, retrieve relevant information,
and synthesize coherent and accurate responses. Given the increasing real-world
deployment of such systems, comprehensive evaluation becomes crucial. To this
end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set),
a high-quality evaluation dataset designed to test LLMs' ability to provide
factual responses, assess retrieval capabilities, and evaluate the reasoning
required to generate final answers. While previous work has provided datasets
and benchmarks to evaluate these abilities in isolation, FRAMES offers a
unified framework that provides a clearer picture of LLM performance in
end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions
that require the integration of information from multiple sources. We present
baseline results demonstrating that even state-of-the-art LLMs struggle with
this task, achieving 0.40 accuracy with no retrieval. The accuracy is
significantly improved with our proposed multi-step retrieval pipeline,
achieving an accuracy of 0.66 (>50% improvement). We hope our work will help
bridge evaluation gaps and assist in developing more robust and capable RAG
systems.Summary
AI-Generated Summary