MedAlign:一個由臨床醫師生成的電子病歷指示遵循數據集
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records
August 27, 2023
作者: Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A. Jindal, Eduardo P. Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins, Ethan Steinberg, Ashwin Nayak, Birju S. Patel, Chia-Chun Chiang, Alison Callahan, Zepeng Huo, Sergios Gatidis, Scott J. Adams, Oluseyi Fayanju, Shreya J. Shah, Thomas Savage, Ethan Goh, Akshay S. Chaudhari, Nima Aghaeepour, Christopher Sharp, Michael A. Pfeffer, Percy Liang, Jonathan H. Chen, Keith E. Morse, Emma P. Brunskill, Jason A. Fries, Nigam H. Shah
cs.AI
摘要
大型語言模型(LLMs)具備以人類水準流暢度遵循自然語言指令的能力,暗示在醫療保健領域有許多機會可以減輕行政負擔並提高護理質量。然而,在現實的醫療文本生成任務中評估LLMs仍然具有挑戰性。現有的用於電子健康記錄(EHR)數據的問答數據集未能捕捉臨床醫師所面臨的信息需求和文檔負擔的複雜性。為應對這些挑戰,我們引入了MedAlign,這是一個包含983條EHR數據的自然語言指令基準數據集。MedAlign由15名臨床醫師(7個專業領域)精心編輯,包括303條指令的臨床醫師撰寫的參考回應,並提供276份長期EHR以鞏固指令-回應對。我們使用MedAlign來評估6個一般領域的LLMs,請臨床醫師對每個LLM的回應準確性和質量進行排名。我們發現高錯誤率,從35%(GPT-4)到68%(MPT-7B-Instruct)不等,並且GPT-4在從32k到2k上下文長度時準確率下降了8.3%。最後,我們報告了臨床醫師排名和自動自然語言生成指標之間的相關性,作為一種無需人工審查即可對LLMs進行排名的方法。我們通過研究數據使用協議提供MedAlign,以便在與臨床醫師需求和偏好對齊的任務上進行LLM評估。
English
The ability of large language models (LLMs) to follow natural language
instructions with human-level fluency suggests many opportunities in healthcare
to reduce administrative burden and improve quality of care. However,
evaluating LLMs on realistic text generation tasks for healthcare remains
challenging. Existing question answering datasets for electronic health record
(EHR) data fail to capture the complexity of information needs and
documentation burdens experienced by clinicians. To address these challenges,
we introduce MedAlign, a benchmark dataset of 983 natural language instructions
for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes
clinician-written reference responses for 303 instructions, and provides 276
longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to
evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality
of each LLM response. We found high error rates, ranging from 35% (GPT-4) to
68% (MPT-7B-Instruct), and an 8.3% drop in accuracy moving from 32k to 2k
context lengths for GPT-4. Finally, we report correlations between clinician
rankings and automated natural language generation metrics as a way to rank
LLMs without human review. We make MedAlign available under a research data use
agreement to enable LLM evaluations on tasks aligned with clinician needs and
preferences.