通用大语言模型在医学基准测试中表现优于临床工具
Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks
December 1, 2025
作者: Krithik Vishwanath, Mrigayu Ghosh, Anton Alyakin, Daniel Alexander Alber, Yindalon Aphinyanaphongs, Eric Karl Oermann
cs.AI
摘要
专业临床AI助手正快速进入医疗实践领域,其宣传口径常强调比通用大语言模型更安全可靠。然而与前沿模型不同,这些临床工具很少接受独立的定量评估——尽管它们对诊断、分诊和指南解读的影响日益扩大,这种评估缺失造成了关键证据空白。我们通过整合MedQA(医学知识)与HealthBench(临床对齐)任务的千项微型基准测试,对两款广泛部署的临床AI系统(OpenEvidence与UpToDate Expert AI)和三种顶尖通用大语言模型(GPT-5、Gemini 3 Pro及Claude Sonnet 4.5)进行对比评估。结果显示通用模型持续优于临床工具,其中GPT-5得分最高;而OpenEvidence与UpToDate在回答完整性、沟通质量、情境意识和系统化安全推理方面存在明显不足。这些发现表明,标榜为临床决策支持的工具可能普遍落后于前沿大语言模型,这凸显了在面向患者的诊疗流程中部署前进行透明独立评估的紧迫性。
English
Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.