OpenTSLM：多次元医療テキストおよび時系列データに対する推論のための時系列言語モデル

要旨

LLM（大規模言語モデル）は、マルチモーダルデータを解釈する強力なツールとして登場している。医療分野では、特に大量の臨床情報を実践的な洞察やデジタルヘルスアプリケーションに統合する可能性が期待されている。しかし、主要な制約として、時系列データを扱う能力が欠如している点が挙げられる。このギャップを克服するため、我々はOpenTSLMを提案する。これは、事前学習済みのLLMに時系列をネイティブなモダリティとして統合し、任意の長さの複数の時系列データに対する推論を可能にする、時系列言語モデル（TSLM）のファミリーである。OpenTSLMの2つのアーキテクチャを検討した。1つ目はOpenTSLM-SoftPromptで、学習可能な時系列トークンをテキストトークンとソフトプロンプトを介して連結することで、時系列を暗黙的にモデル化する。パラメータ効率は高いが、明示的な時系列モデル化の方がスケーラビリティと性能において優れると仮定し、2つ目のアーキテクチャであるOpenTSLM-Flamingoを導入した。これは、時系列とテキストをクロスアテンションを介して統合する。両バリアントを、時系列をテキストトークンまたはプロットとして扱うベースラインと比較し、一連のテキスト-時系列連鎖思考（CoT）推論タスクで評価した。3つのデータセット（HAR-CoT、Sleep-CoT、ECG-QA-CoT）を導入し、すべてのデータセットにおいてOpenTSLMモデルがベースラインを上回り、睡眠段階判定では69.9 F1、HARでは65.4を達成した。これは、テキストのみのファインチューニングモデルの9.05および52.2と比較して高い値である。特に、1BパラメータのOpenTSLMモデルでさえGPT-4o（15.47および2.95）を上回った。OpenTSLM-Flamingoは、OpenTSLM-SoftPromptと同等の性能を維持し、長いシーケンスでは優れた性能を示しつつ、メモリ要件も安定していた。一方、SoftPromptはシーケンス長に応じてメモリ使用量が指数関数的に増加し、ECG-QAをLLaMA-3Bでトレーニングする際に約110 GBのVRAMを必要とした（Flamingoは40 GB）。臨床医による専門家レビューでは、OpenTSLMがECG-QAにおいて強力な推論能力を示すことが確認された。さらなる研究を促進するため、すべてのコード、データセット、モデルをオープンソースとして提供する。

English

LLMs have emerged as powerful tools for interpreting multimodal data. In medicine, they hold particular promise for synthesizing large volumes of clinical information into actionable insights and digital health applications. Yet, a major limitation remains their inability to handle time series. To overcome this gap, we present OpenTSLM, a family of Time Series Language Models (TSLMs) created by integrating time series as a native modality to pretrained LLMs, enabling reasoning over multiple time series of any length. We investigate two architectures for OpenTSLM. The first, OpenTSLM-SoftPrompt, models time series implicitly by concatenating learnable time series tokens with text tokens via soft prompting. Although parameter-efficient, we hypothesize that explicit time series modeling scales better and outperforms implicit approaches. We thus introduce OpenTSLM-Flamingo, which integrates time series with text via cross-attention. We benchmark both variants against baselines that treat time series as text tokens or plots, across a suite of text-time-series Chain-of-Thought (CoT) reasoning tasks. We introduce three datasets: HAR-CoT, Sleep-CoT, and ECG-QA-CoT. Across all, OpenTSLM models outperform baselines, reaching 69.9 F1 in sleep staging and 65.4 in HAR, compared to 9.05 and 52.2 for finetuned text-only models. Notably, even 1B-parameter OpenTSLM models surpass GPT-4o (15.47 and 2.95). OpenTSLM-Flamingo matches OpenTSLM-SoftPrompt in performance and outperforms on longer sequences, while maintaining stable memory requirements. By contrast, SoftPrompt grows exponentially in memory with sequence length, requiring around 110 GB compared to 40 GB VRAM when training on ECG-QA with LLaMA-3B. Expert reviews by clinicians find strong reasoning capabilities exhibited by OpenTSLMs on ECG-QA. To facilitate further research, we provide all code, datasets, and models open-source.

OpenTSLM：多次元医療テキストおよび時系列データに対する推論のための時系列言語モデル

OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data

要旨

Support