为何Transformer模型在时间序列的上下文预测中表现欠佳?
Why Do Transformers Fail to Forecast Time Series In-Context?
October 10, 2025
作者: Yufa Zhou, Yixiao Wang, Surbhi Goel, Anru R. Zhang
cs.AI
摘要
时间序列预测(TSF)在机器学习中仍是一个具有挑战性且很大程度上未解决的问题,尽管近期利用大型语言模型(LLMs)的努力显著增加,这些模型主要依赖于Transformer架构。实证研究一致表明,即使在TSF任务中,强大的Transformer模型也往往无法超越更简单的模型,例如线性模型;然而,对于这一现象的严格理论理解仍然有限。在本文中,我们通过上下文学习(ICL)理论的视角,对Transformer在TSF中的局限性进行了理论分析。具体而言,在AR(p)数据下,我们确立了以下几点:(1) 线性自注意力(LSA)模型在上下文预测中无法实现比经典线性模型更低的期望均方误差(MSE);(2) 当上下文长度趋近于无穷大时,LSA渐近地恢复最优线性预测器;(3) 在思维链(CoT)式推理下,预测值会以指数速度收敛到均值。我们通过精心设计的实验对这些发现进行了实证验证。我们的理论不仅揭示了几个先前未被充分探索的现象,还为设计更有效的预测架构提供了实用见解。我们希望我们的工作能鼓励更广泛的研究社区重新审视TSF的基本理论限制,并在没有深入审查的情况下,批判性地评估日益复杂架构的直接应用。
English
Time series forecasting (TSF) remains a challenging and largely unsolved
problem in machine learning, despite significant recent efforts leveraging
Large Language Models (LLMs), which predominantly rely on Transformer
architectures. Empirical evidence consistently shows that even powerful
Transformers often fail to outperform much simpler models, e.g., linear models,
on TSF tasks; however, a rigorous theoretical understanding of this phenomenon
remains limited. In this paper, we provide a theoretical analysis of
Transformers' limitations for TSF through the lens of In-Context Learning (ICL)
theory. Specifically, under AR(p) data, we establish that: (1) Linear
Self-Attention (LSA) models cannot achieve lower expected MSE than
classical linear models for in-context forecasting; (2) as the context length
approaches to infinity, LSA asymptotically recovers the optimal linear
predictor; and (3) under Chain-of-Thought (CoT) style inference, predictions
collapse to the mean exponentially. We empirically validate these findings
through carefully designed experiments. Our theory not only sheds light on
several previously underexplored phenomena but also offers practical insights
for designing more effective forecasting architectures. We hope our work
encourages the broader research community to revisit the fundamental
theoretical limitations of TSF and to critically evaluate the direct
application of increasingly sophisticated architectures without deeper
scrutiny.