EHR-R1:一种面向电子健康记录分析的推理增强型基础语言模型
EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis
October 29, 2025
作者: Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie
cs.AI
摘要
电子健康记录(EHR)蕴含丰富但复杂的信息,其自动化分析对临床决策至关重要。尽管大语言模型(LLM)在临床工作流程中取得进展,但由于任务覆盖范围狭窄且缺乏面向EHR的推理能力,其分析EHR的能力仍受限。本文旨在弥合这一差距——我们提出EHR-Ins,一个大规模、综合性的EHR推理指令数据集,涵盖42项不同EHR任务的30万高质量推理案例与400万非推理案例。其核心创新在于思维图谱驱动的框架,可实现大规模高质量推理数据生成。基于此,我们开发了EHR-R1系列推理增强型大语言模型(参数量最高达720亿),专为EHR分析定制。通过领域适应、推理增强和强化学习三阶段训练范式,EHR-R1系统化掌握领域知识与多样化推理能力,实现精准稳健的EHR分析。最后,我们推出基于MIMIC-IV构建的新基准EHR-Bench,涵盖42项任务以全面评估EHR场景下的推理与预测能力。实验表明,EHR-R1在MIMIC-Bench上以超过30分的优势持续领先前沿商业及开源LLM(包括DeepSeek-V3和GPT-4o),在EHRSHOT上实现10%的零样本AUROC提升。EHR-Ins、EHR-R1与EHR-Bench共同推动了更可靠、更具临床相关性的EHR分析发展。
English
Electronic Health Records (EHRs) contain rich yet complex information, and
their automated analysis is critical for clinical decision-making. Despite
recent advances of large language models (LLMs) in clinical workflows, their
ability to analyze EHRs remains limited due to narrow task coverage and lack of
EHR-oriented reasoning capabilities. This paper aims to bridge the gap,
specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning
instruction dataset, comprising 300k high-quality reasoning cases and 4M
non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a
thinking-graph-driven framework that enables to generate high-quality reasoning
data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced
LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage
training paradigm, including domain adaptation, reasoning enhancement, and
reinforcement learning, EHR-R1 systematically acquires domain knowledge and
diverse reasoning capabilities, enabling accurate and robust EHR analysis.
Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning
42 tasks, to comprehensively assess reasoning and prediction across EHR
scenarios. In experiments, we show that the resulting EHR-R1 consistently
outperforms state-of-the-art commercial and open-source LLMs (including
DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and
achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins,
EHR-R1, and EHR-Bench have significantly advanced the development for more
reliable and clinically relevant EHR analysis.