測量大型語言模型在誤導性醫療情境下的認知韌性

摘要

大型語言模型（LLMs）現已在醫學執照考試中取得專家級分數，這助長了一種假設：高分即代表安全的醫療判斷，而患者也越來越頻繁地將其用於健康建議。我們表明此假設是脆弱的：當將誤導性上下文注入LLMs原本能夠正確回答的問題時，它們會放棄正確答案。我們將在對抗性上下文中維持正確判斷的能力稱為「認知韌性」，並引入MedMisBench來衡量此能力。MedMisBench包含10,932道醫學問題項目，以及48,889組誤導性上下文與選項對，涵蓋醫療推理、代理能力與患者旅程評估。在11種模型配置中，平均準確率從原始問題的71.1%下降至集中誤導性上下文下的38.0%，攻擊成功率達51.5%。最具破壞性的注入是正式、類似規則的虛構建構：以權威框架呈現的虛假陳述達到69.5%的攻擊成功率，而例外毒化型說法則達到64.1%。一個來自7個國家的14名臨床專家小組，在38.2%的受審查案例中識別出嚴重的潛在危害。MedMisBench揭示了LLMs在醫療環境評估中的結構性盲點：現有基準衡量的是模型知道什麼，而非它們在誤導性上下文下是否能保持正確的醫療判斷。

English

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.