ChatPaper.aiChatPaper

使用RAGElo評估RAG-Fusion:一個自動化基於Elo的框架

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

June 20, 2024
作者: Zackary Rackauckas, Arthur Câmara, Jakub Zavrel
cs.AI

摘要

在自動評估檢索增強生成(RAG)問答(QA)系統方面的挑戰包括領域特定知識中的幻覺問題以及公司內部任務缺乏黃金標準基準。這導致在評估RAG變體(如RAG-Fusion,RAGF)時出現困難,尤其是在英飛凌科技的產品QA任務背景下。為解決這些問題,我們提出了一個全面的評估框架,利用大型語言模型(LLMs)生成基於真實用戶查詢和領域內文檔的大型合成查詢數據集,使用LLM作為評分員對檢索的文檔和答案進行評分,評估答案的質量,並使用RAGElo的自動Elo比賽對檢索增強生成(RAG)代理的不同變體進行排名。對一組隨機樣本的合成查詢進行的LLM作為評分員評分顯示,在相關性、準確性、完整性和精確性方面與領域專家評分存在中等正相關。雖然RAGF在Elo分數上優於RAG,但與專家注釋的重要性分析也顯示RAGF在完整性方面顯著優於RAG,但在精確性方面表現不佳。此外,英飛凌的RAGF助手在基於MRR@5分數的文檔相關性方面表現略高。我們發現RAGElo與人類標註者的偏好保持一致,但仍需謹慎。最後,RAGF的方法根據專家標註提供了更完整的答案,並根據RAGElo的評估標準提供了更好的答案。
English
Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.

Summary

AI-Generated Summary

PDF172November 29, 2024