RepLiQA: 未見の参照コンテンツに対する大規模言語モデルのベンチマーキングのための質問応答データセット

要旨

大規模言語モデル（LLMs）は、インターネットから自動的に収集された膨大な量のデータを基に訓練されています。このデータには、百科事典的な文書（例えばWikipedia）が含まれており、一般的な知識を大量に保有していますが、LLMsの評価に使用されるベンチマークデータセットと重複する可能性もあります。その結果、訓練セットに漏れ込んでいる可能性のあるテスト分割でモデルを評価することは、誤った結論を導く危険性があります。言語モデルの健全な評価を促進するため、我々は質問応答やトピック検索タスクに適した新しいテストデータセット「RepLiQA」を導入します。RepLiQAは5つのテストセット分割から成り、そのうち4つは本発表以前にインターネットに公開されたり、LLM APIに曝露されたりしていません。RepLiQAの各サンプルは、(1)人間のアノテーターによって作成され、インターネット上に存在しない架空のシナリオ（例えばニュース記事）を描いた参照文書、(2)文書のトピックに関する質問、(3)文書の情報から直接導かれた正解、(4)正解を含む参照文書から抽出された段落、で構成されています。したがって、正確な回答を生成するためには、モデルが提供された文書内で関連する内容を見つけられる必要があります。我々は、様々なタイプやサイズのモデル間の性能差を明らかにするため、コンテキスト条件付き言語モデリング設定において、いくつかの最先端LLMsを含む大規模なベンチマークを実行しました。RepLiQAの公開された分割はこちらで確認できます：https://huggingface.co/datasets/ServiceNow/repliqa。

English

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

RepLiQA: 未見の参照コンテンツに対する大規模言語モデルのベンチマーキングのための質問応答データセット

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

要旨

Support