RepLiQA: 참조 콘텐츠에 노출되지 않은 상태에서 대형 언어 모델을 벤치마킹하기 위한 질의응답 데이터셋

초록

대규모 언어 모델(LLMs)은 인터넷에서 자동으로 수집된 방대한 양의 데이터를 기반으로 학습됩니다. 이 데이터에는 일반 지식을 풍부하게 담고 있는 백과사전 문서(예: 위키피디아)가 포함되어 있지만, 동시에 LLM 평가를 위해 사용되는 벤치마크 데이터셋과 중복될 가능성도 있습니다. 결과적으로, 학습 데이터셋에 유출되었을 수 있는 테스트 데이터셋을 사용하여 모델을 평가하는 것은 잘못된 결론을 초래할 수 있습니다. 언어 모델의 건전한 평가를 촉진하기 위해, 우리는 질의응답 및 주제 검색 작업에 적합한 새로운 테스트 데이터셋인 RepLiQA를 소개합니다. RepLiQA는 다섯 개의 테스트셋 분할로 구성되어 있으며, 이 중 네 개는 이번 출판 전까지 인터넷에 공개되거나 LLM API에 노출된 적이 없습니다. RepLiQA의 각 샘플은 (1) 인간 주석자가 작성한 가상 시나리오(예: 뉴스 기사)를 담고 있으며 인터넷에 존재하지 않는 참조 문서, (2) 문서의 주제에 대한 질문, (3) 문서의 정보를 직접 활용한 정답, (4) 정답을 포함한 참조 문서의 단락으로 구성됩니다. 따라서 정확한 답변을 생성하려면 모델이 제공된 문서 내에서 관련 내용을 찾을 수 있어야 합니다. 우리는 다양한 유형과 크기의 모델 간 성능 차이를 분석하기 위해 최신 LLM들을 포함한 대규모 벤치마크를 실행하며, 이는 맥락 조건부 언어 모델링 설정에서 이루어졌습니다. RepLiQA의 공개된 분할은 다음 링크에서 확인할 수 있습니다: https://huggingface.co/datasets/ServiceNow/repliqa.

English

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

RepLiQA: 참조 콘텐츠에 노출되지 않은 상태에서 대형 언어 모델을 벤치마킹하기 위한 질의응답 데이터셋

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

초록

Support