VisionTS: ビジュアルマスク付きオートエンコーダーは、フリーランチのゼロショット時系列予測器です。

要旨

ファウンデーションモデルは、時系列予測（TSF）における有望なアプローチとして登場しています。既存の手法は、大規模言語モデル（LLMs）を微調整するか、大規模な時系列データセットを構築してTSFファウンデーションモデルを開発しています。しかしながら、これらの手法は、厳しいクロスドメインのギャップやドメイン内の異質性による課題に直面しています。本論文では、画像と時系列の間の本質的な類似性に基づいて、豊富で高品質な自然画像からTSFファウンデーションモデルを構築する新たなアプローチを探求します。両ドメイン間のギャップを埋めるために、TSFタスクを画像再構成タスクとして再定義し、さらにImageNetデータセットで事前学習された視覚マスク付きオートエンコーダ（MAE）によって処理されます。驚くべきことに、時系列ドメインでのさらなる適応なしに、提案されたVisionTSは、既存のTSFファウンデーションモデルと比較して優れたゼロショット予測性能を達成することができました。最小限の微調整により、VisionTSは予測をさらに改善し、ほとんどの場合で最先端の性能を達成することができました。これらの結果は、視覚モデルがTSFにとって無料の昼食である可能性を示唆し、コンピュータビジョンとTSFの間の将来のクロスドメイン研究の可能性を強調しています。当該コードは、https://github.com/Keytoyze/VisionTS で公開されています。

English

Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at https://github.com/Keytoyze/VisionTS.