大语言模型中的临床知识无法直接转化为人际互动能力
Clinical knowledge in LLMs does not translate to human interactions
April 26, 2025
作者: Andrew M. Bean, Rebecca Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera, Sara Hincapié Monsalve, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher, Adam Mahdi
cs.AI
摘要
全球医疗保健服务提供者正在探索利用大型语言模型(LLMs)为公众提供医疗建议。目前,LLMs在医疗执照考试中几乎取得了满分,但这并不必然意味着其在现实场景中的表现同样准确。我们在一项包含1,298名参与者的对照研究中,测试了LLMs能否帮助公众识别潜在病情并选择处理方案(处置方式),涉及十种医疗情境。参与者被随机分配接受LLM(GPT-4o、Llama 3、Command R+)或自选来源(对照组)的协助。单独测试时,LLMs能准确完成情境,平均正确识别病情达94.9%,处置方式正确率为56.3%。然而,使用相同LLMs的参与者识别相关病情的情况不到34.5%,选择处置方式的正确率低于44.2%,均未优于对照组。我们指出,用户交互是LLMs应用于医疗建议部署中的一大挑战。现有的医学知识标准测试和模拟患者互动并不能预测我们在人类参与者中发现的失败案例。展望未来,我们建议在医疗保健领域的公开部署前,进行系统化的人类用户测试,以评估其交互能力。
English
Global healthcare providers are exploring use of large language models (LLMs)
to provide medical advice to the public. LLMs now achieve nearly perfect scores
on medical licensing exams, but this does not necessarily translate to accurate
performance in real-world settings. We tested if LLMs can assist members of the
public in identifying underlying conditions and choosing a course of action
(disposition) in ten medical scenarios in a controlled study with 1,298
participants. Participants were randomly assigned to receive assistance from an
LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested
alone, LLMs complete the scenarios accurately, correctly identifying conditions
in 94.9% of cases and disposition in 56.3% on average. However, participants
using the same LLMs identified relevant conditions in less than 34.5% of cases
and disposition in less than 44.2%, both no better than the control group. We
identify user interactions as a challenge to the deployment of LLMs for medical
advice. Standard benchmarks for medical knowledge and simulated patient
interactions do not predict the failures we find with human participants.
Moving forward, we recommend systematic human user testing to evaluate
interactive capabilities prior to public deployments in healthcare.Summary
AI-Generated Summary