GroUSE：一個評估基於實境問答中評估者的基準。

摘要

檢索增強生成（RAG）已成為一種常見範式，用於將大型語言模型（LLMs）與私人和最新知識庫結合使用。在這項工作中，我們解決了在評估RAG系統生成的基於事實的答案時使用LLM作為評判的挑戰。為了評估評判模型的校準和區分能力，我們確定了7種生成器失敗模式並引入了GroUSE（Grounded QA Unitary Scoring of Evaluators），這是一個包含144個單元測試的元評估基準。該基準顯示，即使在使用GPT-4作為評判時，現有的自動化RAG評估框架通常會忽略重要的失敗模式。為了改進當前設計的自動化RAG評估框架，我們提出了一個新穎的流程，發現雖然封閉模型在GroUSE上表現良好，但最先進的開源評判並不能泛化到我們提出的標準，儘管與GPT-4的判斷有很強的相關性。我們的研究結果表明，與GPT-4的相關性是評判模型實際性能的不完整代理，應該與單元測試上的評估相結合，以便準確檢測失敗模式。此外，我們進一步展示，將Llama-3在GPT-4的推理跟踪上進行微調，顯著提升了其評估能力，改善了與GPT-4評估的相關性以及對參考情況的校準。

English

Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4's judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4's reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4's evaluations and calibration on reference situations.

GroUSE：一個評估基於實境問答中評估者的基準。

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

摘要

Support