EHR-R1:一种用于电子健康记录分析的推理增强型基础语言模型
EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis
October 29, 2025
作者: Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie
cs.AI
摘要
電子健康記錄(EHR)蘊含豐富而複雜的資訊,其自動化分析對臨床決策至關重要。儘管大語言模型(LLM)在臨床工作流程中取得進展,但由於任務覆蓋範圍狹窄且缺乏面向EHR的推理能力,其分析EHR的效能仍受限。本文旨在彌合這一鴻溝:我們提出EHR-Ins——一個大規模、綜合性的EHR推理指令數據集,包含42項獨特EHR任務中的30萬個高質量推理案例與400萬個非推理案例。其核心創新在於思維圖驅動的框架,能實現大規模高質量推理數據生成。基於此,我們開發了EHR-R1系列推理增強型大語言模型(參數規模最高達720億),專為EHR分析定制。通過包含領域適配、推理增強和強化學習的多階段訓練範式,EHR-R1系統性獲取領域知識與多樣化推理能力,實現精準穩健的EHR分析。最後,我們推出從MIMIC-IV精選的新基準EHR-Bench,涵蓋42項任務以全面評估EHR場景中的推理與預測能力。實驗表明,EHR-R1持續超越當前最先進的商業及開源LLM(包括DeepSeek-V3和GPT-4o),在MIMIC-Bench上較GPT-4o領先逾30分,並在EHRSHOT中實現10%的零樣本AUROC提升。總體而言,EHR-Ins、EHR-R1與EHR-Bench共同推動了更可靠、更具臨床相關性的EHR分析技術發展。
English
Electronic Health Records (EHRs) contain rich yet complex information, and
their automated analysis is critical for clinical decision-making. Despite
recent advances of large language models (LLMs) in clinical workflows, their
ability to analyze EHRs remains limited due to narrow task coverage and lack of
EHR-oriented reasoning capabilities. This paper aims to bridge the gap,
specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning
instruction dataset, comprising 300k high-quality reasoning cases and 4M
non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a
thinking-graph-driven framework that enables to generate high-quality reasoning
data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced
LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage
training paradigm, including domain adaptation, reasoning enhancement, and
reinforcement learning, EHR-R1 systematically acquires domain knowledge and
diverse reasoning capabilities, enabling accurate and robust EHR analysis.
Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning
42 tasks, to comprehensively assess reasoning and prediction across EHR
scenarios. In experiments, we show that the resulting EHR-R1 consistently
outperforms state-of-the-art commercial and open-source LLMs (including
DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and
achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins,
EHR-R1, and EHR-Bench have significantly advanced the development for more
reliable and clinically relevant EHR analysis.