当基准过时：通过大语言模型事实性评估揭示的时间错位问题

摘要

大型语言模型（LLMs）的快速发展与现实世界的变迁，已超越了广泛使用的评估基准的静态特性，引发了对其在评估LLM事实准确性方面可靠性的担忧。尽管大量研究仍依赖于流行但陈旧的基准，这些基准与现实世界事实及现代LLMs的时间错位，以及它们对LLM事实准确性评估的影响，尚未得到充分探讨。因此，在本研究中，我们通过考察五个流行的事实性基准和八个不同年份发布的LLMs，对此问题进行了系统性调查。我们定制了一套最新的信息检索流程和三项指标，以量化基准的老化及其对LLM事实准确性评估的影响。实验结果表明，广泛使用的事实性基准中有相当一部分样本已过时，导致对LLM事实准确性的评估不可靠。我们希望我们的工作能为评估基准在LLM事实准确性评估中的可靠性提供一个测试平台，并激发更多关于基准老化问题的研究。代码可在https://github.com/JiangXunyi/BenchAge获取。

English

The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.

当基准过时：通过大语言模型事实性评估揭示的时间错位问题

When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

摘要

Support