ChatPaper.aiChatPaper

MEDIC:面向临床应用的LLM综合评估框架

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

September 11, 2024
作者: Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan
cs.AI

摘要

大型语言模型(LLMs)在医疗应用领域的快速发展引发了对综合评估的呼吁,超越了像USMLE这样经常引用的基准,以更好地反映现实世界的性能。虽然现实世界的评估是有价值的效用指标,但它们往往落后于LLM演进的速度,可能导致部署后的发现过时。这种时间上的脱节要求进行全面的前期评估,以指导特定临床应用的模型选择。我们介绍了MEDIC,这是一个评估LLMs在临床能力的五个关键维度上的框架:医学推理、伦理和偏见、数据和语言理解、上下文学习以及临床安全性。MEDIC采用了一种新颖的交叉检验框架,量化LLM在覆盖范围和幻觉检测等领域的表现,而无需参考输出。我们应用MEDIC来评估LLMs在医学问答、安全性、摘要、笔记生成和其他任务上的表现。我们的结果显示了模型规模、基线与医学微调模型之间的性能差异,并对需要特定模型优势的应用的模型选择产生影响,例如低幻觉或较低推理成本。MEDIC的多方面评估揭示了这些性能权衡,弥合了理论能力与在医疗设置中的实际实施之间的差距,确保最有前途的模型被确定并为各种医疗应用所采用。
English
The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.

Summary

AI-Generated Summary

PDF576November 16, 2024