GroUSE：一个用于评估基于问答语境的评估器的基准测试

摘要

检索增强生成（RAG）已经成为一种常见范式，用于在私人和最新知识库旁边使用大型语言模型（LLMs）。在这项工作中，我们解决了使用LLM作为评估RAG系统生成的基于事实答案时的挑战。为了评估评估模型的校准和区分能力，我们确定了7种生成器失败模式，并引入了GroUSE（评估者的基础问答单一评分），这是一个包含144个单元测试的元评估基准。该基准显示，即使使用GPT-4作为评判者，现有的自动化RAG评估框架经常忽视重要的失败模式。为了改进当前的自动化RAG评估框架设计，我们提出了一种新颖的流程，并发现尽管封闭模型在GroUSE上表现良好，但最先进的开源评判者并不能推广到我们提出的标准，尽管与GPT-4的判断有很强的相关性。我们的发现表明，与GPT-4的相关性是评判者模型实际性能的不完整代理，并应通过对单元测试的评估来补充精确的失败模式检测。我们进一步展示，通过在GPT-4的推理轨迹上对Llama-3进行微调，可以显著提升其评估能力，改善与GPT-4评估的相关性以及在参考情况下的校准。

English

Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4's judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4's reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4's evaluations and calibration on reference situations.

GroUSE：一个用于评估基于问答语境的评估器的基准测试

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

摘要

Support