ChatPaper.aiChatPaper

當基準過時:透過大型語言模型事實性評估的時間錯位

When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

October 8, 2025
作者: Xunyi Jiang, Dingyi Chang, Julian McAuley, Xin Xu
cs.AI

摘要

大型語言模型(LLMs)與現實世界的快速發展已超越了廣泛使用的評估基準的靜態特性,這引發了對其評估LLM事實性可靠性的擔憂。儘管大量研究仍依賴於流行但陳舊的基準,這些基準與現實世界事實及現代LLMs的時間錯位,以及它們對LLM事實性評估的影響,仍未得到充分探討。因此,在本研究中,我們通過檢驗五個流行的事實性基準和八個不同年份發布的LLMs,對這一問題進行了系統性調查。我們定制了一個最新的事實檢索管道和三種指標,以量化基準的老化及其對LLM事實性評估的影響。實驗結果與分析表明,廣泛使用的事實性基準中有相當一部分樣本已過時,導致對LLM事實性的評估不可靠。我們希望我們的工作能提供一個測試平台,用於評估基準在LLM事實性評估中的可靠性,並激發更多關於基準老化問題的研究。代碼可在https://github.com/JiangXunyi/BenchAge獲取。
English
The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.
PDF132October 9, 2025