基多基准：高质量开源时间序列预测基准

摘要

时间序列预测在金融、医疗和云计算领域至关重要，但发展受到一个根本性瓶颈的制约：缺乏大规模高质量的基准数据集。为弥补这一空白，我们推出QuitoBench——一个面向八种趋势-季节性-可预测性（TSF）机制均衡覆盖的时序预测基准，其设计重点在于捕捉预测相关特性而非应用定义的领域标签。该基准基于Quito构建，这是源自支付宝业务流量、涵盖九大商业领域的十亿级时间序列语料库。通过对10个深度学习、基础模型及统计基线模型在232,200个评估实例上的测试，我们得出四项关键发现：（一）存在上下文长度交叉现象：深度学习模型在短上下文（L=96）领先，而基础模型在长上下文（L≥576）占优；（二）可预测性是主要难度驱动因素，不同机制间平均绝对误差差距达3.64倍；（三）深度学习模型以59倍更少的参数量达到或超越基础模型性能；（四）对两类模型家族而言，增加训练数据量带来的收益远大于扩大模型规模。这些发现通过跨基准和跨指标的一致性验证得到强化。我们的开源发布为时间序列预测研究提供了可复现的机制感知评估框架。

English

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting with coverage across eight trendtimesseasonalitytimesforecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon Quito, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context (L=96) but foundation models dominate at long context (L ge 576); (ii) forecastability is the dominant difficulty driver, producing a 3.64 times MAE gap across regimes; (iii) deep learning models match or surpass foundation models at 59 times fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.