Het meten van epistemische veerkracht van LLM's in misleidende medische context

Samenvatting

Grote taalmodellen (LLM's) behalen nu expertniveauscores op medische licentietoetsen, wat de veronderstelling aanmoedigt dat hoge scores duiden op veilig medisch oordeelsvermogen, terwijl patiënten ze steeds vaker gebruiken voor gezondheidsadvies. Wij tonen aan dat deze veronderstelling fragiel is: wanneer misleidende context wordt ingebracht in vragen die LLM's oorspronkelijk correct beantwoorden, laten zij het juiste antwoord varen. Wij noemen het vermogen om correct oordeelsvermogen te behouden onder misleidende context 'epistemische veerkracht' en introduceren MedMisBench om dit te meten. MedMisBench bevat 10.932 medische vraagitems en 48.889 paren van misleidende context en opties, die medische redenering, agentische capaciteit en evaluatie van het patiënttraject bestrijken. Over 11 modelconfiguraties daalt de gemiddelde nauwkeurigheid van 71,1% op oorspronkelijke vragen naar 38,0% onder gerichte misleidende context, met 51,5% aanvalssucces. De meest schadelijke injecties zijn formele, regelachtige verzinsels: autoritair ingekaderde onwaarheden bereiken 69,5% aanvalssucces en uitzonderingsvergiftigingsclaims bereiken 64,1%. Een klinisch panel van 14 leden uit 7 landen identificeerde ernstige potentiële schade in 38,2% van de beoordeelde gevallen. MedMisBench legt een structurele blinde vlek bloot in de evaluatie van LLM's in medische omgevingen: bestaande benchmarks meten wat modellen weten, maar niet of ze correct medisch oordeelsvermogen behouden onder misleidende context.

English

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.