OpenTSLM:面向多变量医疗文本与时间序列数据推理的时间序列语言模型
OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data
October 2, 2025
作者: Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A. Xu, Winnie Chow, Martin Maritsch, Aradhana Verma, Brian Han, Daniel Seung Kim, Henry Chubb, Scott Ceresnak, Aydin Zahedivash, Alexander Tarlochan Singh Sandhu, Fatima Rodriguez, Daniel McDuff, Elgar Fleisch, Oliver Aalami, Filipe Barata, Paul Schmiedmayer
cs.AI
摘要
大型語言模型(LLMs)已成為解讀多模態數據的強大工具。在醫學領域,它們尤其展現出將大量臨床信息綜合轉化為可操作見解和數字健康應用的潛力。然而,其主要限制在於無法處理時間序列數據。為彌補這一不足,我們提出了OpenTSLM,這是一系列時間序列語言模型(TSLMs),通過將時間序列作為原生模態整合到預訓練的LLMs中,從而實現對任意長度多時間序列的推理。我們探討了OpenTSLM的兩種架構。第一種,OpenTSLM-SoftPrompt,通過軟提示將可學習的時間序列標記與文本標記隱式地結合起來來建模時間序列。儘管這種方法參數效率高,我們假設顯式時間序列建模具有更好的擴展性和性能。因此,我們引入了OpenTSLM-Flamingo,它通過交叉注意力機制將時間序列與文本整合。我們在一系列文本-時間序列的思維鏈(CoT)推理任務中,將這兩種變體與將時間序列視為文本標記或圖表的基線模型進行了對比。我們引入了三個數據集:HAR-CoT、Sleep-CoT和ECG-QA-CoT。在所有數據集上,OpenTSLM模型均優於基線模型,在睡眠分期中達到69.9的F1分數,在HAR中達到65.4,而僅微調的純文本模型分別為9.05和52.2。值得注意的是,即使是1B參數的OpenTSLM模型也超越了GPT-4o(15.47和2.95)。OpenTSLM-Flamingo在性能上與OpenTSLM-SoftPrompt相當,並在更長的序列上表現更優,同時保持穩定的內存需求。相比之下,SoftPrompt的內存需求隨序列長度呈指數增長,在ECG-QA上使用LLaMA-3B訓練時需要約110 GB的顯存,而Flamingo僅需40 GB。臨床專家的評審發現OpenTSLM在ECG-QA上展現出強大的推理能力。為促進進一步研究,我們開源了所有代碼、數據集和模型。
English
LLMs have emerged as powerful tools for interpreting multimodal data. In
medicine, they hold particular promise for synthesizing large volumes of
clinical information into actionable insights and digital health applications.
Yet, a major limitation remains their inability to handle time series. To
overcome this gap, we present OpenTSLM, a family of Time Series Language Models
(TSLMs) created by integrating time series as a native modality to pretrained
LLMs, enabling reasoning over multiple time series of any length. We
investigate two architectures for OpenTSLM. The first, OpenTSLM-SoftPrompt,
models time series implicitly by concatenating learnable time series tokens
with text tokens via soft prompting. Although parameter-efficient, we
hypothesize that explicit time series modeling scales better and outperforms
implicit approaches. We thus introduce OpenTSLM-Flamingo, which integrates time
series with text via cross-attention. We benchmark both variants against
baselines that treat time series as text tokens or plots, across a suite of
text-time-series Chain-of-Thought (CoT) reasoning tasks. We introduce three
datasets: HAR-CoT, Sleep-CoT, and ECG-QA-CoT. Across all, OpenTSLM models
outperform baselines, reaching 69.9 F1 in sleep staging and 65.4 in HAR,
compared to 9.05 and 52.2 for finetuned text-only models. Notably, even
1B-parameter OpenTSLM models surpass GPT-4o (15.47 and 2.95). OpenTSLM-Flamingo
matches OpenTSLM-SoftPrompt in performance and outperforms on longer sequences,
while maintaining stable memory requirements. By contrast, SoftPrompt grows
exponentially in memory with sequence length, requiring around 110 GB compared
to 40 GB VRAM when training on ECG-QA with LLaMA-3B. Expert reviews by
clinicians find strong reasoning capabilities exhibited by OpenTSLMs on ECG-QA.
To facilitate further research, we provide all code, datasets, and models
open-source.