通用基础模型尚不能满足医院临床运营需求
Generalist Foundation Models Are Not Clinical Enough for Hospital Operations
November 17, 2025
作者: Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann
cs.AI
摘要
医院和医疗系统的运行依赖于决定患者流、成本及护理质量的运营决策。尽管通用文本训练的基础模型在医学知识和对话基准测试中表现优异,但它们可能缺乏这些运营决策所需的专业知识。我们推出Lang1模型系列(参数规模1亿至70亿),其预训练数据融合了来自纽约大学朗格尼健康中心电子健康记录的800亿临床标记和来自互联网的6270亿标记。为在真实场景中严格评估Lang1,我们开发了现实医疗评估基准(ReMedE),该基准基于668,331份电子健康记录笔记,评估五大关键任务:30天再入院预测、30天死亡率预测、住院时长、合并症编码和保险拒赔预测。在零样本场景下,通用模型与专业模型在五项任务中有四项表现不佳(AUROC为36.6%-71.7%),仅死亡率预测例外。经微调后,Lang1-1B模型性能优于参数规模达其70倍的微调通用模型,以及参数规模达其671倍的零样本模型,AUROC分别提升3.64%-6.75%和1.66%-23.66%。我们还观察到跨任务扩展效应:对多任务联合微调可提升其他任务表现。Lang1-1B能有效迁移至分布外场景,包括其他临床任务和外部医疗系统。研究表明,医院运营的预测能力需要显式监督微调,而基于电子健康记录的领域内预训练可提升微调效率。我们的发现印证了新兴观点:专业大语言模型能在特定任务中与通用模型竞争,并表明构建有效的医疗系统人工智能需要结合领域内预训练、监督微调以及超越代理基准的真实场景评估。
English
Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.