OpenTSLM: 다변량 의료 텍스트 및 시계열 데이터 추론을 위한 시계열 언어 모델

초록

LLM(대형 언어 모델)은 다중 모드 데이터를 해석하는 강력한 도구로 부상했다. 의학 분야에서는 특히 대량의 임상 정보를 실행 가능한 통찰력과 디지털 헬스 애플리케이션으로 통합하는 데 큰 잠재력을 보여준다. 그러나 주요 한계점은 시계열 데이터를 처리할 수 없다는 것이다. 이 격차를 극복하기 위해, 우리는 사전 훈련된 LLM에 시계열을 기본 모드로 통합하여 임의 길이의 다중 시계열에 대한 추론을 가능하게 하는 시계열 언어 모델(TSLM) 패밀리인 OpenTSLM을 제안한다. 우리는 OpenTSLM을 위한 두 가지 아키텍처를 연구했다. 첫 번째인 OpenTSLM-SoftPrompt는 학습 가능한 시계열 토큰을 텍스트 토큰과 소프트 프롬프팅을 통해 암묵적으로 모델링한다. 이는 매개변수 효율적이지만, 명시적 시계열 모델링이 더 나은 확장성과 성능을 보일 것이라고 가정한다. 따라서 우리는 시계열과 텍스트를 교차 주의(cross-attention)를 통해 통합하는 OpenTSLM-Flamingo를 소개한다. 우리는 두 변형을 시계열을 텍스트 토큰이나 플롯으로 처리하는 베이스라인과 비교하여 텍스트-시계열 사고 연쇄(Chain-of-Thought, CoT) 추론 작업 세트에서 벤치마크를 수행했다. 우리는 HAR-CoT, Sleep-CoT, ECG-QA-CoT 세 가지 데이터셋을 소개한다. 모든 데이터셋에서 OpenTSLM 모델은 베이스라인을 능가하며, 수면 단계 분류에서 69.9 F1, HAR에서 65.4를 달성했고, 이는 텍스트 전용 모델의 9.05와 52.2에 비해 우수한 성능을 보였다. 특히, 10억 매개변수의 OpenTSLM 모델조차 GPT-4o(15.47 및 2.95)를 능가했다. OpenTSLM-Flamingo는 OpenTSLM-SoftPrompt와 성능이 비슷하며, 더 긴 시퀀스에서 더 나은 성능을 보이면서도 안정적인 메모리 요구 사항을 유지했다. 반면, SoftPrompt는 시퀀스 길이에 따라 메모리가 기하급수적으로 증가하며, ECG-QA 데이터셋에서 LLaMA-3B를 훈련할 때 약 110GB의 VRAM이 필요했던 반면, Flamingo는 40GB만 사용했다. 임상 전문가들의 리뷰에 따르면, OpenTSLM은 ECG-QA에서 강력한 추론 능력을 보였다. 추가 연구를 촉진하기 위해, 우리는 모든 코드, 데이터셋, 모델을 오픈소스로 제공한다.

English

LLMs have emerged as powerful tools for interpreting multimodal data. In medicine, they hold particular promise for synthesizing large volumes of clinical information into actionable insights and digital health applications. Yet, a major limitation remains their inability to handle time series. To overcome this gap, we present OpenTSLM, a family of Time Series Language Models (TSLMs) created by integrating time series as a native modality to pretrained LLMs, enabling reasoning over multiple time series of any length. We investigate two architectures for OpenTSLM. The first, OpenTSLM-SoftPrompt, models time series implicitly by concatenating learnable time series tokens with text tokens via soft prompting. Although parameter-efficient, we hypothesize that explicit time series modeling scales better and outperforms implicit approaches. We thus introduce OpenTSLM-Flamingo, which integrates time series with text via cross-attention. We benchmark both variants against baselines that treat time series as text tokens or plots, across a suite of text-time-series Chain-of-Thought (CoT) reasoning tasks. We introduce three datasets: HAR-CoT, Sleep-CoT, and ECG-QA-CoT. Across all, OpenTSLM models outperform baselines, reaching 69.9 F1 in sleep staging and 65.4 in HAR, compared to 9.05 and 52.2 for finetuned text-only models. Notably, even 1B-parameter OpenTSLM models surpass GPT-4o (15.47 and 2.95). OpenTSLM-Flamingo matches OpenTSLM-SoftPrompt in performance and outperforms on longer sequences, while maintaining stable memory requirements. By contrast, SoftPrompt grows exponentially in memory with sequence length, requiring around 110 GB compared to 40 GB VRAM when training on ECG-QA with LLaMA-3B. Expert reviews by clinicians find strong reasoning capabilities exhibited by OpenTSLMs on ECG-QA. To facilitate further research, we provide all code, datasets, and models open-source.

OpenTSLM: 다변량 의료 텍스트 및 시계열 데이터 추론을 위한 시계열 언어 모델

OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data

초록

Support