TRIAGE：利用大型语言模型对不规则采样的医疗时间序列进行可解释风险预测的辩证推理

摘要

基于电子健康记录的临床早期预警系统（临床观察以不规则采样的医疗时间序列形式记录）必须提供校准的风险评分以用于患者分诊，以及可供临床医生验证的可解释性理由。大语言模型已被探索用于此任务，但它们将分级的临床风险压缩为过度自信的二元预测。这种风险极化损害了校准能力和跨患者可比性。为解决此问题，我们提出TRIAGE框架，该框架通过引发特定结局的推理理由，训练大语言模型对相互竞争的临床结局生成辩证推理。这种辩证表述减轻了风险极化，使单一模型能够基于明确的临床推理产生连续风险评分。在三个不规则采样医疗时间序列基准测试上的评估表明，与竞争基线相比，TRIAGE平均AUPRC提升3.3%，校准误差降低81%。基于大语言模型作为评判者的评估进一步显示，我们的推理理由在临床推理质量上比基线的后验解释高出20%。源代码已公开于https://github.com/HyeongWon-Jang/TRIAGE。

English

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .