ChatPaper.aiChatPaper

CRAG -- 全面RAG基准

CRAG -- Comprehensive RAG Benchmark

June 7, 2024
作者: Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
cs.AI

摘要

最近,检索增强生成(RAG)已经成为缓解大型语言模型(LLM)在知识匮乏方面的不足的一种有前景的解决方案。然而,现有的RAG数据集并未充分代表现实世界问答(QA)任务的多样性和动态性质。为了弥合这一差距,我们引入了全面的RAG基准(CRAG),这是一个包含4,409个问答对和模拟Web和知识图谱(KG)搜索的虚拟API的事实问答基准。CRAG旨在涵盖五个领域和八个问题类别的各种问题,反映了从热门到长尾的各种实体流行度以及从年份到秒的时间动态。我们在这一基准上的评估突显了实现完全可信问答的差距。大多数先进的LLM在CRAG上的准确率仅为<=34%,简单地添加RAG仅将准确率提高到44%。最先进的工业RAG解决方案仅能回答63%的问题而不产生幻觉。CRAG还揭示了在回答涉及具有更高动态性、较低流行度或更高复杂性事实的问题时准确率较低,这表明了未来的研究方向。CRAG基准奠定了KDD Cup 2024挑战的基础,在竞赛的前50天内吸引了数千名参与者和提交。我们致力于维护CRAG,以服务研究社区推进RAG解决方案和通用QA解决方案。
English
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.

Summary

AI-Generated Summary

PDF497December 8, 2024