审计多模态LLM评分者：临床序数评分中的集中趋势偏差

摘要

多模态大语言模型（LLMs）正被越来越多地探索作为临床场景中的自动评估工具，然而它们在序数量表上的评分行为仍缺乏深入理解。本研究以三项前沿LLM家族为基准，将其与监督深度学习模型在基于Shulman评分标准的两类公开数据集上进行画钟测验（CDT）图像评分比较。尽管完全微调的视觉Transformer取得了最佳校准性能（MAE 0.52，±1准确率91%），零样本LLM在容差一致性指标上仍具竞争力（GPT-5 MAE 0.67，±1准确率92%），但其绝对误差较高。然而，逐得分分析揭示，所有三类LLM家族均表现出显著的集中趋势效应（系统性端点压缩）：预测值系统性地向量表中间值压缩，在低分端（得分0至1）存在过度预测，在高分端（得分5至4）存在低估。这种效应不对称地影响了临床关键极端值，而恰恰在这些极端值上，准确评分对认知障碍筛查决策影响最大。针对性的消融实验表明，无论是采用涵盖全得分范围的少样本示例，还是从提示中去除临床术语，均无法消除该效应。本研究将LLM作为评审的偏差分析文献从自然语言处理评估拓展至临床评估领域，并强调在将LLM评分系统部署至高风险筛查流程前，需进行考虑校准性能的评估及事后校准。

English

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.