RepLiQA:一个用于在未见参考内容上对大型语言模型进行基准测试的问答数据集
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
June 17, 2024
作者: Joao Monteiro, Pierre-Andre Noel, Etienne Marcotte, Sai Rajeswar, Valentina Zantedeschi, David Vazquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
cs.AI
摘要
大型语言模型(LLMs)是在大量数据上训练的,其中大部分数据是从互联网自动抓取的。这些数据包括包含大量通识知识的百科文件(例如维基百科),但也可能与用于评估LLMs的基准数据集重叠。因此,在可能泄漏到训练集中的测试分割上评估模型容易导致误导性结论。为了促进语言模型的准确评估,我们引入了一个名为RepLiQA的新测试数据集,适用于问答和主题检索任务。RepLiQA是一个包含五个测试集分割的集合,其中有四个在本出版之前尚未发布到互联网或暴露给LLM API。RepLiQA中的每个样本包括(1)由人工注释者创建的描述虚构场景(例如新闻文章)的参考文档;(2)关于文档主题的问题;(3)直接从文档信息中提取的真实答案;以及(4)包含答案的从参考文档中提取的段落。因此,只有当模型能够在提供的文档中找到相关内容时,才能生成准确答案。我们进行了一个大规模基准测试,包括几种最先进的LLMs,以揭示在上下文条件语言建模设置中各种类型和大小模型之间性能差异。RepLiQA的已发布分割可在此处找到:https://huggingface.co/datasets/ServiceNow/repliqa。
English
Large Language Models (LLMs) are trained on vast amounts of data, most of
which is automatically scraped from the internet. This data includes
encyclopedic documents that harbor a vast amount of general knowledge (e.g.,
Wikipedia) but also potentially overlap with benchmark datasets used for
evaluating LLMs. Consequently, evaluating models on test splits that might have
leaked into the training set is prone to misleading conclusions. To foster
sound evaluation of language models, we introduce a new test dataset named
RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a
collection of five splits of test sets, four of which have not been released to
the internet or exposed to LLM APIs prior to this publication. Each sample in
RepLiQA comprises (1) a reference document crafted by a human annotator and
depicting an imaginary scenario (e.g., a news article) absent from the
internet; (2) a question about the document's topic; (3) a ground-truth answer
derived directly from the information in the document; and (4) the paragraph
extracted from the reference document containing the answer. As such, accurate
answers can only be generated if a model can find relevant content within the
provided document. We run a large-scale benchmark comprising several
state-of-the-art LLMs to uncover differences in performance across models of
various types and sizes in a context-conditional language modeling setting.
Released splits of RepLiQA can be found here:
https://huggingface.co/datasets/ServiceNow/repliqa.Summary
AI-Generated Summary