ChatPaper.aiChatPaper

LiveMedBench:基于自动化评分标准的无污染医学大语言模型基准测试平台

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

February 10, 2026
作者: Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun
cs.AI

摘要

在临床高风险场景中部署大语言模型(LLMs)需要进行严格可靠的评估。然而,现有医学基准测试仍保持静态模式,存在两个关键缺陷:(1)数据污染,即测试集意外渗入训练语料库,导致性能评估虚高;(2)时间错位,无法捕捉医学知识的快速演进。此外,当前针对开放式临床推理的评估指标往往依赖于浅层词汇重叠度(如ROUGE)或主观的"LLM即评判者"打分,二者均无法有效验证临床正确性。为弥补这些不足,我们推出LiveMedBench——一个基于标准化量规、持续更新且无数据污染的基准测试平台,该平台每周从在线医疗社区采集真实临床案例,确保与模型训练数据严格时间隔离。我们提出多智能体临床筛选框架,可过滤原始数据噪声,并依据循证医学原则验证临床完整性。在评估方面,我们开发了基于量规的自动化评估框架,将医生回答分解为细粒度的病例特异性标准,其与专家医生的契合度显著优于"LLM即评判者"方法。截至目前,LiveMedBench已涵盖38个医学专科的2,756个真实病例(支持多语言),并配有16,702条独特评估标准。对38个LLMs的大规模评估显示,即使最优模型仅达到39.2%的准确率,且84%的模型在时间截点后的病例上出现性能下降,证实了数据污染的普遍风险。错误分析进一步表明,语境应用能力(而非事实知识)是主要瓶颈,35%-48%的失败案例源于无法将医学知识适配到患者特定约束条件。
English
The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.
PDF91February 13, 2026