ChatPaper.aiChatPaper

TSRBench:面向通用模型的多任务多模态时间序列推理综合基准测试平台

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

January 26, 2026
作者: Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou
cs.AI

摘要

时间序列数据在现实场景中无处不在,对从能源管理到交通控制等关键应用至关重要。因此,具备时间序列推理能力成为通用模型解决实际问题的核心技能。然而现有通用模型基准测试明显缺失这一维度。为填补这一空白,我们推出TSRBench——一个全面的多模态基准测试平台,旨在系统检验时间序列推理的全方位能力。该平台具有两大特点:其一,涵盖14个领域的4125个多样化问题,并按感知、推理、预测和决策制定四大维度分类;其二,通过四大维度中的15项任务评估核心推理能力(如数值推理)。我们通过对30余个领先的专有及开源大语言模型、视觉语言模型和时序大语言模型开展大规模实验发现:第一,规模扩展定律在感知与推理维度成立,但在预测维度失效;第二,强大的推理能力不能保证准确的上下文感知预测,表明语义理解与数值预测之间存在解耦现象;第三,尽管时间序列的文本与视觉表征存在互补性,现有多模态模型仍无法有效融合二者实现协同增效。TSRBench提供的标准化评估平台不仅揭示了现存挑战,更为推进通用模型发展提供了宝贵洞见。相关代码与数据集已发布于https://tsrbench.github.io/。
English
Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve practical problems. However, this dimension is notably absent from existing benchmarks of generalist models. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluated over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual represenations of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.
PDF31January 28, 2026