벤치마크가 노후화될 때: 대규모 언어 모델 사실성 평가를 통한 시간적 불일치

초록

대규모 언어 모델(LLMs)과 현실 세계의 급속한 발전은 널리 사용되는 평가 벤치마크의 정적 특성을 앞지르며, LLM 사실성 평가에 대한 신뢰성 문제를 제기하고 있습니다. 상당수의 연구가 여전히 인기 있지만 오래된 벤치마크에 의존하고 있음에도 불구하고, 이러한 벤치마크의 현실 세계 사실 및 현대 LLM과의 시간적 불일치와 그들이 LLM 사실성 평가에 미치는 영향은 충분히 탐구되지 않고 있습니다. 따라서 본 연구에서는 이 문제를 체계적으로 조사하기 위해 5개의 인기 있는 사실성 벤치마크와 여러 해에 걸쳐 출시된 8개의 LLM을 검토합니다. 최신 사실 검색 파이프라인과 세 가지 메트릭을 활용하여 벤치마크의 노후화와 LLM 사실성 평가에 미치는 영향을 정량화합니다. 실험 결과와 분석을 통해 널리 사용되는 사실성 벤치마크의 상당 부분이 시대에 뒤떨어져 있어 LLM 사실성 평가의 신뢰성이 떨어지는 것으로 나타났습니다. 우리의 연구가 LLM 사실성 평가를 위한 벤치마크의 신뢰성을 평가하는 테스트베드를 제공하고, 벤치마크 노후화 문제에 대한 더 많은 연구를 촉발할 수 있기를 바랍니다. 코드는 https://github.com/JiangXunyi/BenchAge에서 확인할 수 있습니다.

English

The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.