Auditeren van multimodale LLM-beoordelaars: Centrale tendentie bias in klinische ordinale scoring

Samenvatting

Multimodale grote taalmodellen (LLM's) worden steeds vaker onderzocht als geautomatiseerde beoordelaars in klinische omgevingen, maar hun scoreringsgedrag op ordinale klinische schalen is nog slecht begrepen. We benchmarken drie toonaangevende LLM-families tegen gesuperviseerde diepe leermodellen voor het scoren van Kloktekentest (CDT)-afbeeldingen op twee openbare datasets met behulp van de Shulman-rubriek. Terwijl volledig fijngetunede Vision Transformers de beste kalibratie bereiken (MAE 0,52, binnen-1 nauwkeurigheid 91%), blijven zero-shot LLM's concurrerend op tolerantie-gebaseerde overeenkomst (GPT-5 MAE 0,67, binnen-1 nauwkeurigheid 92%), ondanks een hogere absolute fout. Per-score analyse onthult echter dat alle drie LLM-families een uitgesproken centraal tendentie-effect vertonen (systematische eindpuntcompressie): voorspellingen worden systematisch naar het midden van de schaal gecomprimeerd, met overvoorspelling aan de lage kant (score 0 naar 1) en ondervoorspelling aan de hoge kant (score 5 naar 4). Dit effect treft onevenredig de klinisch kritische extremen waar nauwkeurige scoring de meeste invloed heeft op screeningsbeslissingen voor cognitieve beperkingen. Gerichte ablatie studies tonen aan dat noch few-shot voorbeelden die het volledige scorebereik bestrijken, noch het verwijderen van klinische terminologie uit de prompt het effect elimineert. Onze bevindingen breiden de LLM-als-beoordelaar bias-literatuur uit van NLP-evaluatie naar klinische beoordeling en benadrukken de noodzaak van kalibratiebewuste evaluatie en post-hoc kalibratie voordat LLM-gebaseerde beoordelaars worden ingezet in risicovolle screening workflows.

English

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.