ChatPaper.aiChatPaper

GroUSE:一个用于评估基于问答语境的评估器的基准测试

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

September 10, 2024
作者: Sacha Muller, António Loison, Bilel Omrani, Gautier Viaud
cs.AI

摘要

检索增强生成(RAG)已经成为一种常见范式,用于在私人和最新知识库旁边使用大型语言模型(LLMs)。在这项工作中,我们解决了使用LLM作为评估RAG系统生成的基于事实答案时的挑战。为了评估评估模型的校准和区分能力,我们确定了7种生成器失败模式,并引入了GroUSE(评估者的基础问答单一评分),这是一个包含144个单元测试的元评估基准。该基准显示,即使使用GPT-4作为评判者,现有的自动化RAG评估框架经常忽视重要的失败模式。 为了改进当前的自动化RAG评估框架设计,我们提出了一种新颖的流程,并发现尽管封闭模型在GroUSE上表现良好,但最先进的开源评判者并不能推广到我们提出的标准,尽管与GPT-4的判断有很强的相关性。我们的发现表明,与GPT-4的相关性是评判者模型实际性能的不完整代理,并应通过对单元测试的评估来补充精确的失败模式检测。 我们进一步展示,通过在GPT-4的推理轨迹上对Llama-3进行微调,可以显著提升其评估能力,改善与GPT-4评估的相关性以及在参考情况下的校准。
English
Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4's judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4's reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4's evaluations and calibration on reference situations.

Summary

AI-Generated Summary

PDF382November 16, 2024