GroUSE:一個評估基於實境問答中評估者的基準。
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
September 10, 2024
作者: Sacha Muller, António Loison, Bilel Omrani, Gautier Viaud
cs.AI
摘要
檢索增強生成(RAG)已成為一種常見範式,用於將大型語言模型(LLMs)與私人和最新知識庫結合使用。在這項工作中,我們解決了在評估RAG系統生成的基於事實的答案時使用LLM作為評判的挑戰。為了評估評判模型的校準和區分能力,我們確定了7種生成器失敗模式並引入了GroUSE(Grounded QA Unitary Scoring of Evaluators),這是一個包含144個單元測試的元評估基準。該基準顯示,即使在使用GPT-4作為評判時,現有的自動化RAG評估框架通常會忽略重要的失敗模式。
為了改進當前設計的自動化RAG評估框架,我們提出了一個新穎的流程,發現雖然封閉模型在GroUSE上表現良好,但最先進的開源評判並不能泛化到我們提出的標準,儘管與GPT-4的判斷有很強的相關性。我們的研究結果表明,與GPT-4的相關性是評判模型實際性能的不完整代理,應該與單元測試上的評估相結合,以便準確檢測失敗模式。
此外,我們進一步展示,將Llama-3在GPT-4的推理跟踪上進行微調,顯著提升了其評估能力,改善了與GPT-4評估的相關性以及對參考情況的校準。
English
Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use
Large Language Models (LLMs) alongside private and up-to-date knowledge bases.
In this work, we address the challenges of using LLM-as-a-Judge when evaluating
grounded answers generated by RAG systems. To assess the calibration and
discrimination capabilities of judge models, we identify 7 generator failure
modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a
meta-evaluation benchmark of 144 unit tests. This benchmark reveals that
existing automated RAG evaluation frameworks often overlook important failure
modes, even when using GPT-4 as a judge.
To improve on the current design of automated RAG evaluation frameworks, we
propose a novel pipeline and find that while closed models perform well on
GroUSE, state-of-the-art open-source judges do not generalize to our proposed
criteria, despite strong correlation with GPT-4's judgement. Our findings
suggest that correlation with GPT-4 is an incomplete proxy for the practical
performance of judge models and should be supplemented with evaluations on unit
tests for precise failure mode detection.
We further show that finetuning Llama-3 on GPT-4's reasoning traces
significantly boosts its evaluation capabilities, improving upon both
correlation with GPT-4's evaluations and calibration on reference situations.Summary
AI-Generated Summary