ChatPaper.aiChatPaper

CRAG -- 全面性 RAG 基準

CRAG -- Comprehensive RAG Benchmark

June 7, 2024
作者: Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
cs.AI

摘要

最近,檢索增強生成(RAG)作為一種有希望的解決方案,用以緩解大型語言模型(LLM)在知識不足方面的不足。然而,現有的RAG數據集並未充分代表現實世界問答(QA)任務的多樣性和動態性。為彌合這一差距,我們引入了全面的RAG基準(CRAG),這是一個包含4,409個問答對和模擬網絡和知識圖(KG)搜索的虛構API的事實問答基準。CRAG旨在涵蓋五個領域和八個問題類別中的各種問題,反映了從熱門到長尾的各種實體流行度,以及從年份到秒鐘的時間動態。我們在這個基準上的評估凸顯了實現完全可信問答的差距。儘管大多數先進的LLM在CRAG上的準確率僅達到34%以下,但直接添加RAG只能將準確率提高到44%。最先進的工業RAG解決方案僅能回答63%的問題而無幻覺。CRAG還顯示在回答具有較高動態性、較低流行度或較高複雜性事實的問題時準確率更低,這暗示了未來的研究方向。CRAG基準奠定了KDD Cup 2024挑戰的基礎,在競賽的前50天內吸引了成千上萬的參與者和提交。我們承諾維護CRAG,以服務研究社區,推進RAG解決方案和通用QA解決方案。
English
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.

Summary

AI-Generated Summary

PDF497December 8, 2024