誤導的医療文脈下でのLLMの認識論的レジリエンスの測定

要旨

大規模言語モデル（LLM）は現在、医師免許試験において専門家レベルのスコアを達成しており、高スコアが安全な医療判断を意味するという前提を助長し、患者が健康アドバイスを求めてLLMを利用するケースが増えている。本研究では、この前提が脆弱であることを示す。すなわち、LLMが元来正しく回答できる設問に誤解を招く文脈を注入すると、正答を放棄するのである。本研究では、敵対的文脈下で正しい判断を維持する能力を認識的レジリエンス（epistemic resilience）と呼び、それを測定するためのMedMisBenchを導入する。MedMisBenchは、10,932件の医療質問項目と48,889組の誤解を招く文脈・選択肢ペアから構成され、医療推論、エージェント能力、患者経路評価を網羅する。11のモデル構成において、平均正答率は元の設問の71.1%から、焦点化された誤解を招く文脈下では38.0%に低下し、攻撃成功率は51.5%に達した。最も有害な注入は、形式的で規則らしい捏造であり、権威を装った虚偽では攻撃成功率69.5%、例外を悪用した主張では64.1%に達した。7か国14名からなる臨床パネルは、審査対象事例の38.2%において深刻な潜在的害を特定した。MedMisBenchは、医療環境におけるLLM評価の構造的な死角を明らかにする。すなわち、既存のベンチマークはモデルが何を知っているかを測定するが、誤解を招く文脈下でも正しい医療判断を維持できるかどうかは測定していない。

English

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.