使用RAGElo评估RAG-Fusion:一种基于Elo的自动化框架
Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework
June 20, 2024
作者: Zackary Rackauckas, Arthur Câmara, Jakub Zavrel
cs.AI
摘要
在自动评估检索增强生成(RAG)问答(QA)系统方面的挑战包括领域特定知识中的臆想问题以及公司内部任务缺乏黄金标准基准。这导致了在评估RAG变体(如RAG融合,RAGF)时遇到困难,尤其是在英飞凌技术公司的产品QA任务背景下。为解决这些问题,我们提出了一个全面的评估框架,利用大型语言模型(LLMs)生成基于真实用户查询和领域内文档的大型合成查询数据集,利用LLM作为评判者对检索到的文档和答案进行评分,评估答案的质量,并使用RAGElo的自动Elo竞赛对检索增强生成(RAG)代理的不同变体进行排名。对一组合成查询的随机样本进行的LLM作为评判者的评分显示,在相关性、准确性、完整性和精确性方面与领域专家评分存在适度正相关。虽然RAGF在Elo分数上胜过RAG,但与专家注释的显著性分析也显示RAGF在完整性方面明显优于RAG,但在精确性方面表现不佳。此外,英飞凌的RAGF助手根据MRR@5分数显示在文档相关性方面表现略高。我们发现RAGElo与人类注释者的偏好保持一致,尽管仍需谨慎。最后,根据专家注释,RAGF的方法导致了更完整的答案,并根据RAGElo的评估标准提供了更好的答案。
English
Challenges in the automated evaluation of Retrieval-Augmented Generation
(RAG) Question-Answering (QA) systems include hallucination problems in
domain-specific knowledge and the lack of gold standard benchmarks for company
internal tasks. This results in difficulties in evaluating RAG variations, like
RAG-Fusion (RAGF), in the context of a product QA task at Infineon
Technologies. To solve these problems, we propose a comprehensive evaluation
framework, which leverages Large Language Models (LLMs) to generate large
datasets of synthetic queries based on real user queries and in-domain
documents, uses LLM-as-a-judge to rate retrieved documents and answers,
evaluates the quality of answers, and ranks different variants of
Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based
competition. LLM-as-a-judge rating of a random sample of synthetic queries
shows a moderate, positive correlation with domain expert scoring in relevance,
accuracy, completeness, and precision. While RAGF outperformed RAG in Elo
score, a significance analysis against expert annotations also shows that RAGF
significantly outperforms RAG in completeness, but underperforms in precision.
In addition, Infineon's RAGF assistant demonstrated slightly higher performance
in document relevance based on MRR@5 scores. We find that RAGElo positively
aligns with the preferences of human annotators, though due caution is still
required. Finally, RAGF's approach leads to more complete answers based on
expert annotations and better answers overall based on RAGElo's evaluation
criteria.Summary
AI-Generated Summary