通用型大语言模型在医疗基准测试中表现优于临床工具

摘要

专业临床人工智能助手正快速进入医疗实践领域，其宣传口径常强调比通用大语言模型更安全可靠。然而与前沿模型不同，这些临床工具很少接受独立的量化评估——尽管它们对诊断、分诊和指南解读的影响日益增强，这种评估缺失造成了关键证据空白。我们通过融合MedQA（医学知识）与HealthBench（临床对齐）任务的千项微型基准测试，对两款广泛应用的临床AI系统（OpenEvidence和UpToDate Expert AI）与三种顶尖通用大语言模型（GPT-5、Gemini 3 Pro和Claude Sonnet 4.5）进行对比评估。结果显示通用模型表现持续优于临床工具，其中GPT-5得分最高；而OpenEvidence和UpToDate在回答完整性、沟通质量、情境意识及系统化安全推理方面存在明显不足。这些发现表明，标榜为临床决策支持的工具可能普遍落后于前沿大语言模型，这凸显了在面向患者的诊疗流程中部署前，亟需建立透明独立的评估机制。

English

Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.

通用型大语言模型在医疗基准测试中表现优于临床工具

Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

摘要

Support