RAG-FusionをRAGEloで評価：自動化されたEloベースのフレームワーク

要旨

検索拡張生成（RAG）質問応答（QA）システムの自動評価における課題には、ドメイン固有知識における幻覚問題や、企業内部タスクのためのゴールドスタンダードベンチマークの欠如が含まれます。これにより、Infineon Technologiesの製品QAタスクの文脈で、RAG-Fusion（RAGF）のようなRAGのバリエーションを評価することが困難になります。これらの問題を解決するため、我々は包括的な評価フレームワークを提案します。このフレームワークは、大規模言語モデル（LLM）を活用して、実際のユーザークエリとドメイン内ドキュメントに基づく合成クエリの大規模なデータセットを生成し、LLM-as-a-judgeを使用して検索されたドキュメントと回答を評価し、回答の品質を評価し、RAGEloの自動Eloベースの競争を通じて異なるRAGエージェントのバリエーションをランク付けします。合成クエリのランダムサンプルに対するLLM-as-a-judgeの評価は、関連性、正確性、完全性、および精度においてドメインエキスパートのスコアと中程度の正の相関を示しています。RAGFはEloスコアでRAGを上回りましたが、エキスパートアノテーションに対する有意性分析では、RAGFは完全性でRAGを有意に上回る一方、精度では劣ることが示されています。さらに、InfineonのRAGFアシスタントは、MRR@5スコアに基づくドキュメント関連性においてわずかに高いパフォーマンスを示しました。RAGEloは人間のアノテーターの選好と正しく一致していることがわかりましたが、慎重な対応が依然として必要です。最後に、RAGFのアプローチは、エキスパートアノテーションに基づくより完全な回答と、RAGEloの評価基準に基づく全体的により良い回答をもたらすことがわかりました。

English

Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.

RAG-FusionをRAGEloで評価：自動化されたEloベースのフレームワーク

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

要旨

Support