ベンチマークが時代遅れになる時：大規模言語モデルの事実性評価を通じた時間的ずれ

要旨

大規模言語モデル（LLMs）と現実世界の急速な進化は、広く使用されている評価ベンチマークの静的な性質を凌駕し、LLMの事実性評価における信頼性に対する懸念を引き起こしています。多くの研究が依然として人気のあるが古いベンチマークに依存している一方で、それらのベンチマークが現実世界の事実や現代のLLMsとの時間的な不一致、およびLLMの事実性評価への影響については十分に検討されていません。そこで、本研究では、この問題を体系的に調査するために、5つの人気のある事実性ベンチマークと異なる年にリリースされた8つのLLMsを検証します。最新の事実検索パイプラインと3つのメトリクスを活用し、ベンチマークの陳腐化とそれがLLMの事実性評価に与える影響を定量化します。実験結果と分析から、広く使用されている事実性ベンチマークのサンプルの相当部分が時代遅れであり、LLMの事実性評価が信頼できないものであることが明らかになりました。本研究が、LLMの事実性評価におけるベンチマークの信頼性を評価するためのテストベッドを提供し、ベンチマークの陳腐化問題に関するさらなる研究を促すことを期待しています。コードはhttps://github.com/JiangXunyi/BenchAgeで公開されています。

English

The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.