ChatPaper.aiChatPaper

Time-MoE:具有亿级时间序列基础模型的专家混合模型

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

September 24, 2024
作者: Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin
cs.AI

摘要

在过去几十年里,用于时间序列预测的深度学习取得了显著进展。然而,尽管在语言和视觉领域大规模预训练取得成功,预训练的时间序列模型在规模上仍然受限且成本高昂,阻碍了在实际应用中开发更大能力的预测模型。为此,我们引入了Time-MoE,这是一个可扩展且统一的架构,旨在预训练更大、更有能力的预测基础模型,同时降低推断成本。通过利用稀疏的专家混合(MoE)设计,Time-MoE通过仅激活每次预测的网络子集来增强计算效率,降低计算负载同时保持高模型容量。这使得Time-MoE能够有效扩展,而无需相应增加推断成本。Time-MoE包括一系列仅解码器的变压器模型,以自回归方式运行,并支持具有不同输入上下文长度的灵活预测时间跨度。我们在我们新引入的大规模数据Time-300B上预训练了这些模型,该数据跨越9个领域,包含超过3000亿个时间点。我们首次将时间序列基础模型扩展到24亿参数,实现了显著改进的预测精度。我们的结果验证了在时间序列预测环境中训练标记和模型规模的扩展定律的适用性。与具有相同激活参数数量或等效计算预算的密集模型相比,我们的模型始终表现出色。这些进展使Time-MoE成为解决实际时间序列预测挑战的最先进解决方案,具有卓越的能力、效率和灵活性。
English
Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

Summary

AI-Generated Summary

PDF142November 16, 2024