MedAlign:一个由临床医生生成的用于遵循电子病历的数据集
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records
August 27, 2023
作者: Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A. Jindal, Eduardo P. Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins, Ethan Steinberg, Ashwin Nayak, Birju S. Patel, Chia-Chun Chiang, Alison Callahan, Zepeng Huo, Sergios Gatidis, Scott J. Adams, Oluseyi Fayanju, Shreya J. Shah, Thomas Savage, Ethan Goh, Akshay S. Chaudhari, Nima Aghaeepour, Christopher Sharp, Michael A. Pfeffer, Percy Liang, Jonathan H. Chen, Keith E. Morse, Emma P. Brunskill, Jason A. Fries, Nigam H. Shah
cs.AI
摘要
大型语言模型(LLMs)能够以人类水平的流畅度遵循自然语言指令,这表明在医疗保健领域有许多机会,可以减少行政负担并提高护理质量。然而,在医疗保健领域对LLMs进行现实文本生成任务的评估仍然具有挑战性。现有的用于电子健康记录(EHR)数据的问答数据集未能捕捉到临床医生所经历的信息需求复杂性和文档负担。为了解决这些挑战,我们引入了MedAlign,这是一个包含983个EHR数据自然语言指令的基准数据集。MedAlign由15名临床医生(7个专业领域)策划,包括303个指令的临床医生撰写的参考响应,并提供了276个用于指导指令-响应对的纵向EHR。我们使用MedAlign评估了6个通用领域的LLMs,让临床医生对每个LLM的响应准确性和质量进行排名。我们发现高错误率,从35%(GPT-4)到68%(MPT-7B-Instruct)不等,并且将GPT-4的上下文长度从32k减少到2k时,准确率下降了8.3%。最后,我们报告了临床医生排名与自动自然语言生成指标之间的相关性,作为一种无需人工审查即可对LLMs进行排名的方法。我们通过研究数据使用协议提供MedAlign,以便在与临床医生需求和偏好一致的任务上进行LLMs评估。
English
The ability of large language models (LLMs) to follow natural language
instructions with human-level fluency suggests many opportunities in healthcare
to reduce administrative burden and improve quality of care. However,
evaluating LLMs on realistic text generation tasks for healthcare remains
challenging. Existing question answering datasets for electronic health record
(EHR) data fail to capture the complexity of information needs and
documentation burdens experienced by clinicians. To address these challenges,
we introduce MedAlign, a benchmark dataset of 983 natural language instructions
for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes
clinician-written reference responses for 303 instructions, and provides 276
longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to
evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality
of each LLM response. We found high error rates, ranging from 35% (GPT-4) to
68% (MPT-7B-Instruct), and an 8.3% drop in accuracy moving from 32k to 2k
context lengths for GPT-4. Finally, we report correlations between clinician
rankings and automated natural language generation metrics as a way to rank
LLMs without human review. We make MedAlign available under a research data use
agreement to enable LLM evaluations on tasks aligned with clinician needs and
preferences.