QuitoBench:一個高品質的開放式時間序列預測基準
QuitoBench: A High-Quality Open Time Series Forecasting Benchmark
March 27, 2026
作者: Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu
cs.AI
摘要
時間序列預測在金融、醫療和雲端計算等領域至關重要,但其發展受到一個根本性瓶頸的制約:大規模高品質基準數據的稀缺。為解決這一問題,我們推出QuitoBench——一個針對八種趨勢性/季節性/可預測性(TSF)機制進行平衡調配的預測基準,其設計重點在於捕捉與預測相關的特性,而非應用程式定義的領域標籤。該基準建立在Quito基礎之上,這是來自支付寶九大業務領域、規模達十億級的應用流量時間序列語料庫。通過對10種模型(涵蓋深度學習、基礎模型及統計基準方法)在232,200個評估實例上的測試,我們得出四項關鍵發現:(i)上下文長度交叉現象:深度學習模型在短上下文(L=96)領先,而基礎模型在長上下文(L≥576)佔優;(ii)可預測性是主要難度驅動因素,導致不同機制間平均絕對誤差差距達3.64倍;(iii)深度學習模型以參數量減少59倍的條件達到或超越基礎模型性能;(iv)對兩類模型家族而言,擴增訓練數據量帶來的效益遠超擴大模型規模。這些發現均通過跨基準測試和跨指標一致性驗證。我們的開源發布將為時間序列預測研究提供可重現的、機制感知的評估框架。
English
Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting with coverage across eight trendtimesseasonalitytimesforecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon Quito, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context (L=96) but foundation models dominate at long context (L ge 576); (ii) forecastability is the dominant difficulty driver, producing a 3.64 times MAE gap across regimes; (iii) deep learning models match or surpass foundation models at 59 times fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.