ChatPaper.aiChatPaper

通用基础模型尚未达到医院运营所需的临床适用标准

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

November 17, 2025
作者: Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann
cs.AI

摘要

醫院與醫療系統的運作依賴於決定病患流動、成本及照護品質的營運決策。儘管基礎模型在醫學知識和對話基準測試中表現優異,但基於通用文本訓練的模型可能缺乏這些營運決策所需的專業知識。我們推出Lang1模型系列(參數規模1億至70億),其預訓練數據融合了來自紐約大學朗格尼健康中心電子健康記錄的800億臨床標記和來自互聯網的6270億標記。為在真實場景中嚴格評估Lang1,我們開發了真實醫療評估基準(ReMedE),該基準源自668,331份電子健康記錄筆記,評估五大關鍵任務:30天再入院預測、30天死亡率預測、住院時長、共病編碼及保險理賠拒賠預測。在零樣本設定下,通用模型與專業模型在五項任務中有四項表現不佳(AUROC曲線下面積36.6%-71.7%),僅死亡率預測例外。經微調後,Lang1-1B模型不僅優於參數規模達其70倍的微調通用模型,更超越參數規模達其671倍的零樣本模型,AUROC指標分別提升3.64%-6.75%和1.66%-23.66%。我們還觀察到跨任務擴展效應——對多任務聯合微調能提升其他任務表現。Lang1-1B能有效遷移至分佈外場景,包括其他臨床任務及外部醫療系統。我們的研究表明,醫院營運的預測能力需要顯性監督微調,而基於電子健康記錄的領域內預訓練可提升此微調效率。這些發現佐證了新興觀點:專業大語言模型能在特定任務中與通用模型競爭,並揭示有效的醫療系統人工智能需結合領域內預訓練、監督微調及超越代理基準的真實場景評估。
English
Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
PDF202December 1, 2025