ChatPaper.aiChatPaper

RepLiQA:一個用於在未見參考內容上對大型語言模型進行基準測試的問答數據集。

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

June 17, 2024
作者: Joao Monteiro, Pierre-Andre Noel, Etienne Marcotte, Sai Rajeswar, Valentina Zantedeschi, David Vazquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
cs.AI

摘要

大型語言模型(LLMs)是通過大量數據訓練的,其中大部分數據是從互聯網自動抓取的。這些數據包括包含大量通識知識的百科全書文檔(例如維基百科),但也可能與用於評估LLMs的基準數據集存在重疊。因此,在可能已洩漏到訓練集中的測試拆分上評估模型容易導致誤導性結論。為促進語言模型的可靠評估,我們引入了一個名為RepLiQA的新測試數據集,適用於問答和主題檢索任務。RepLiQA是一個包含五個測試集拆分的集合,其中有四個在本出版之前尚未釋出到互聯網或暴露給LLM API。RepLiQA中的每個樣本包括(1)由人類注釋者製作的參考文檔,描述一個虛構情景(例如新聞文章),並不存在於互聯網上;(2)關於文檔主題的問題;(3)直接從文檔信息中提取的真實答案;以及(4)包含答案的參考文檔中提取的段落。因此,只有在模型能夠在提供的文檔中找到相關內容時,才能生成準確答案。我們運行了一個大規模基準測試,包括幾個最先進的LLMs,以揭示在上下文條件語言建模環境中各種類型和大小模型的性能差異。RepLiQA的已發布拆分可在此處找到:https://huggingface.co/datasets/ServiceNow/repliqa。
English
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

Summary

AI-Generated Summary

PDF161December 4, 2024