洞察挖掘者:面向跨领域自然语言对齐的时间序列分析数据集
Insight Miner: A Time Series Analysis Dataset for Cross-Domain Alignment with Natural Language
December 12, 2025
作者: Yunkai Zhang, Yawen Zhang, Ming Zheng, Kezhen Chen, Chongyang Gao, Ruian Ge, Siyuan Teng, Amine Jelloul, Jinmeng Rao, Xiaoyuan Guo, Chiang-Wei Fang, Zeyu Zheng, Jie Yang
cs.AI
摘要
时间序列数据在环境分析、农业、交通、金融等众多科学与工业领域具有关键意义。然而,从这类数据中挖掘洞见通常需要深厚的领域专业知识,这一过程既耗时又费力。本文提出Insight Miner——一个专为生成高质量、综合性时间序列描述而设计的大规模多模态模型,其描述内容融合了领域特定知识。为实现这一目标,我们推出了TS-Insights(数据集地址:\href{https://huggingface.co/datasets/zhykoties/time-series-language-alignment}{https://huggingface.co/datasets/zhykoties/time-series-language-alignment}),这是首个面向通用领域的时间序列与语言对齐数据集。TS-Insights包含从20个预测数据集中采样的10万个时间序列窗口,通过创新的智能体工作流构建:先使用统计工具从原始时间序列中提取特征,再通过GPT-4将其合成为连贯的趋势描述。在TS-Insights上进行指令微调后,Insight Miner在生成时间序列描述与洞见方面超越了LLaVA(liu2023llava)和GPT-4等最先进的多模态模型。我们的研究为利用多模态模型进行时间序列分析开辟了新方向,也为大语言模型将时间序列作为原生输入模态进行解读奠定了重要基础。
English
Time-series data is critical across many scientific and industrial domains, including environmental analysis, agriculture, transportation, and finance. However, mining insights from this data typically requires deep domain expertise, a process that is both time-consuming and labor-intensive. In this paper, we propose Insight Miner, a large-scale multimodal model (LMM) designed to generate high-quality, comprehensive time-series descriptions enriched with domain-specific knowledge. To facilitate this, we introduce TS-InsightsAvailable at \href{https://huggingface.co/datasets/zhykoties/time-series-language-alignment{https://huggingface.co/datasets/zhykoties/time-series-language-alignment}.}, the first general-domain dataset for time series and language alignment. TS-Insights contains 100k time-series windows sampled from 20 forecasting datasets. We construct this dataset using a novel agentic workflow, where we use statistical tools to extract features from raw time series before synthesizing them into coherent trend descriptions with GPT-4. Following instruction tuning on TS-Insights, Insight Miner outperforms state-of-the-art multimodal models, such as LLaVA liu2023llava and GPT-4, in generating time-series descriptions and insights. Our findings suggest a promising direction for leveraging LMMs in time series analysis, and serve as a foundational step toward enabling LLMs to interpret time series as a native input modality.