言語モデル予測評価における落とし穴

要旨

大規模言語モデル（LLM）は最近、予測タスクに適用されるようになり、これらのシステムが人間の性能に匹敵するかそれを上回ると主張する研究も現れています。本論文では、コミュニティとしてそのような結論に注意を払うべきであると主張します。なぜなら、LLMの予測性能を評価することは独特の課題を伴うからです。私たちは、主に2つのカテゴリの問題を特定しました：（1）時間的なリークの多様な形態による評価結果の信頼性の難しさ、（2）評価性能から現実世界の予測への外挿の難しさです。先行研究における体系的な分析と具体的な例を通じて、評価の欠陥が現在および将来の性能主張について懸念を引き起こす可能性を示します。私たちは、LLMの予測能力を確信を持って評価するためには、より厳密な評価方法論が必要であると主張します。

English

Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.

言語モデル予測評価における落とし穴

Pitfalls in Evaluating Language Model Forecasters

要旨

Support