ChatPaper.aiChatPaper

評估語言模型預測者時的陷阱

Pitfalls in Evaluating Language Model Forecasters

May 31, 2025
作者: Daniel Paleka, Shashwat Goel, Jonas Geiping, Florian Tramèr
cs.AI

摘要

大型语言模型(LLMs)近期被应用于预测任务,部分研究声称这些系统的表现可与人类媲美甚至超越。本文主张,作为学术共同体,我们应对此类结论持审慎态度,因为评估LLM预测者面临独特的挑战。我们识别出两大类问题:(1)由于多种形式的时间信息泄露,难以信赖评估结果;(2)从评估表现外推至现实世界预测存在困难。通过系统性分析及先前研究中的具体实例,我们展示了评估缺陷如何引发对当前及未来性能声明的担忧。我们强调,需要更为严谨的评估方法,以自信地评定LLMs的预测能力。
English
Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.
PDF32June 3, 2025