临床文本摘要：调整大型语言模型可超越人类专家

摘要

在海量文本数据中筛选并总结关键信息给临床医生分配时间带来了巨大负担。尽管大型语言模型（LLMs）在自然语言处理（NLP）任务中表现出巨大潜力，但它们在各种临床摘要任务中的有效性尚未得到严格检验。在这项工作中，我们对八个LLMs应用领域自适应方法，涵盖六个数据集和四个不同的摘要任务：放射学报告、患者问题、病历记录和医患对话。我们进行了彻底的定量评估，揭示了模型和自适应方法之间的权衡，以及LLMs最近的进展可能不会带来改进结果的情况。此外，在与六名医生进行的临床阅读者研究中，我们表明最佳适应的LLM的摘要在完整性和正确性方面优于人类摘要。我们随后的定性分析描述了LLMs和人类专家面临的共同挑战。最后，我们将传统的定量NLP指标与阅读者研究评分相关联，以增进我们对这些指标如何与医生偏好一致的理解。我们的研究标志着LLMs在多个任务中胜过人类专家在临床文本摘要中的首次证据。这意味着将LLMs整合到临床工作流程中可以减轻文档负担，使临床医生能够更多地专注于个性化患者护理和医学中其他不可替代的人类方面。

English

Sifting through vast textual data and summarizing key information imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy across diverse clinical summarization tasks has not yet been rigorously examined. In this work, we employ domain adaptation methods on eight LLMs, spanning six datasets and four distinct summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not lead to improved results. Further, in a clinical reader study with six physicians, we depict that summaries from the best adapted LLM are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis delineates mutual challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and other irreplaceable human aspects of medicine.

临床文本摘要：调整大型语言模型可超越人类专家

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

摘要

Support