在误导性医疗语境下测量大语言模型的知识韧性

摘要

大型语言模型现在能够在医学执照考试中达到专家级分数，这促使人们假设高分即意味着安全的医学判断能力，而患者正越来越多地使用这些模型获取健康建议。我们证明这一假设是脆弱的：当原本能正确回答的问题被注入误导性上下文时，模型会放弃正确答案。我们将这种在对抗性上下文下保持正确判断的能力称为"认知韧性"，并引入MedMisBench基准来测量它。MedMisBench包含10,932个医学问题条目和48,889对误导性上下文-选项组合，涵盖医学推理、智能体能力及患者病程评估。在11种模型配置下，原始问题的平均准确率从71.1%降至针对性误导上下文下的38.0%，攻击成功率达51.5%。最具破坏性的注入是形式化、规则化的虚假信息：权威框架虚假陈述的攻击成功率达69.5%，例外投毒式主张达64.1%。由来自7个国家的14名临床专家组成的评审组认定，38.2%的审查案例存在严重潜在危害。MedMisBench揭示了医学场景中大型语言模型评估的结构性盲点：现有基准衡量的是模型"知道什么"，而非在误导性上下文下能否保持正确的医学判断。

English

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.