當基準過時:透過大型語言模型事實性評估的時間錯位
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation
October 8, 2025
作者: Xunyi Jiang, Dingyi Chang, Julian McAuley, Xin Xu
cs.AI
摘要
大型語言模型(LLMs)與現實世界的快速發展已超越了廣泛使用的評估基準的靜態特性,這引發了對其評估LLM事實性可靠性的擔憂。儘管大量研究仍依賴於流行但陳舊的基準,這些基準與現實世界事實及現代LLMs的時間錯位,以及它們對LLM事實性評估的影響,仍未得到充分探討。因此,在本研究中,我們通過檢驗五個流行的事實性基準和八個不同年份發布的LLMs,對這一問題進行了系統性調查。我們定制了一個最新的事實檢索管道和三種指標,以量化基準的老化及其對LLM事實性評估的影響。實驗結果與分析表明,廣泛使用的事實性基準中有相當一部分樣本已過時,導致對LLM事實性的評估不可靠。我們希望我們的工作能提供一個測試平台,用於評估基準在LLM事實性評估中的可靠性,並激發更多關於基準老化問題的研究。代碼可在https://github.com/JiangXunyi/BenchAge獲取。
English
The rapid evolution of large language models (LLMs) and the real world has
outpaced the static nature of widely used evaluation benchmarks, raising
concerns about their reliability for evaluating LLM factuality. While
substantial works continue to rely on the popular but old benchmarks, their
temporal misalignment with real-world facts and modern LLMs, and their effects
on LLM factuality evaluation remain underexplored. Therefore, in this work, we
present a systematic investigation of this issue by examining five popular
factuality benchmarks and eight LLMs released across different years. An
up-to-date fact retrieval pipeline and three metrics are tailored to quantify
benchmark aging and its impact on LLM factuality evaluation. Experimental
results and analysis illustrate that a considerable portion of samples in the
widely used factuality benchmarks are outdated, leading to unreliable
assessments of LLM factuality. We hope our work can provide a testbed to assess
the reliability of a benchmark for LLM factuality evaluation and inspire more
research on the benchmark aging issue. Codes are available in
https://github.com/JiangXunyi/BenchAge.