時系列予測において線形モデルはどの程度まで優れ得るのか？

要旨

時系列予測研究は、容量が精度を向上させるという仮定のもとで、特殊化されたトランスフォーマーから汎用基盤モデルへと、より大規模なアーキテクチャへと着実に移行してきている。我々はこれとは逆の立場をとる。すなわち、性能差の大部分は、モデルをスケールさせるよりも前処理を調整することで、はるかに低いコストで埋めることができる。リッジ回帰は閉形式解と解釈可能な重みを持ち、最適なハイパーパラメータを探索から直接読み取ることができるため、テストベッドとして使用する。我々は、8つの標準ベンチマークにおいて、コンテキスト長、局所正規化、正則化、拡張について探索し、3つのパターンを発見した。(1) 最適なルックバックは系列に強く依存し、予測地平に対して非単調であることが多く、適合された冪乗則指数はETTm2で+0.46からExchangeおよびTrafficで-0.19に及び、より長い地平にはより長い履歴が必要という慣習に疑問を投げかける。(2) コンテキスト全体ではなく、学習されたコンテキストの末尾部分に対して正規化することがほぼ普遍的に好まれる。(3) 同じデータセット内の系列でもハイパーパラメータが異なることが多く、異なる系列間での共有の最適度合いは完全共有から完全に系列ごとまで様々である。得られたモデルは、ほとんどのデータセット・地平の組み合わせにおいて従来の線形予測器を凌駕し、8つのベンチマークのうち6つでTransformer、MLP、CNNのベースラインを上回る。最適化されたハイパーパラメータはデータ自体の診断としても機能し、大規模モデルが学習パラメータに暗黙的に吸収する構造を明らかにする。

English

Time-series forecasting research has been moving steadily toward larger architectures, from specialized transformers to general-purpose foundation models, on the assumption that capacity is what unlocks accuracy. We take the opposite position: most of the gap can be closed at far lower cost by tuning preprocessing rather than scaling models. We use Ridge regression as the testbed, since it has a closed-form solution and interpretable weights, which let the optimal hyperparameters be read off the search directly. We search over context length, local normalization, regularization, and augmentation on eight standard benchmarks and find three patterns. (1) Optimal lookback is strongly series-specific and often non-monotonic in forecast horizon, with fitted power-law exponents ranging from +0.46 on ETTm2 to -0.19 on Exchange and Traffic, challenging the convention that longer horizons need longer history. (2) Normalizing over a learned trailing fraction of the context, rather than its entirety, is almost universally preferred. (3) Series within the same dataset often disagree on hyperparameters; the optimal degree of cross-series sharing varies from fully shared to fully per-series. The resulting models beat prior linear forecasters on most dataset-horizon entries and exceed Transformer, MLP, and CNN baselines on six of eight benchmarks. The optimized hyperparameters also serve as a diagnostic on the data itself, revealing structures that larger models absorb silently into their learned parameters.