评估语言模型预测能力时的常见陷阱
Pitfalls in Evaluating Language Model Forecasters
May 31, 2025
作者: Daniel Paleka, Shashwat Goel, Jonas Geiping, Florian Tramèr
cs.AI
摘要
近期,大型语言模型(LLMs)被应用于预测任务,部分研究声称这些系统已达到或超越人类表现。本文主张,作为研究共同体,我们应对此类结论持审慎态度,因为评估LLM预测者面临独特挑战。我们识别出两大类问题:(1)由于多种形式的时间信息泄露,难以信任评估结果;(2)从评估表现外推至实际预测存在困难。通过系统分析及先前研究中的具体案例,我们展示了评估缺陷如何引发对当前及未来性能声明的担忧。我们强调,需要更为严谨的评估方法,以可靠地评定LLMs的预测能力。
English
Large language models (LLMs) have recently been applied to forecasting tasks,
with some works claiming these systems match or exceed human performance. In
this paper, we argue that, as a community, we should be careful about such
conclusions as evaluating LLM forecasters presents unique challenges. We
identify two broad categories of issues: (1) difficulty in trusting evaluation
results due to many forms of temporal leakage, and (2) difficulty in
extrapolating from evaluation performance to real-world forecasting. Through
systematic analysis and concrete examples from prior work, we demonstrate how
evaluation flaws can raise concerns about current and future performance
claims. We argue that more rigorous evaluation methodologies are needed to
confidently assess the forecasting abilities of LLMs.Summary
AI-Generated Summary