RepLiQA：一个用于在未见参考内容上对大型语言模型进行基准测试的问答数据集

摘要

大型语言模型（LLMs）是在大量数据上训练的，其中大部分数据是从互联网自动抓取的。这些数据包括包含大量通识知识的百科文件（例如维基百科），但也可能与用于评估LLMs的基准数据集重叠。因此，在可能泄漏到训练集中的测试分割上评估模型容易导致误导性结论。为了促进语言模型的准确评估，我们引入了一个名为RepLiQA的新测试数据集，适用于问答和主题检索任务。RepLiQA是一个包含五个测试集分割的集合，其中有四个在本出版之前尚未发布到互联网或暴露给LLM API。RepLiQA中的每个样本包括（1）由人工注释者创建的描述虚构场景（例如新闻文章）的参考文档；（2）关于文档主题的问题；（3）直接从文档信息中提取的真实答案；以及（4）包含答案的从参考文档中提取的段落。因此，只有当模型能够在提供的文档中找到相关内容时，才能生成准确答案。我们进行了一个大规模基准测试，包括几种最先进的LLMs，以揭示在上下文条件语言建模设置中各种类型和大小模型之间性能差异。RepLiQA的已发布分割可在此处找到：https://huggingface.co/datasets/ServiceNow/repliqa。

English

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

RepLiQA：一个用于在未见参考内容上对大型语言模型进行基准测试的问答数据集

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

摘要

Support