審核多模態大型語言模型評分者：臨床序數評分中的集中趨勢偏誤

摘要

多模態大型語言模型（LLM）在臨床環境中作為自動化評分工具的應用日益受到關注，但其在序數臨床量表上的評分行為仍缺乏充分理解。我們以兩個公開資料集為基礎，採用Shulman評分量表對時鐘繪製測試（CDT）影像進行評分，並將三種前沿LLM系列與監督式深度學習模型進行基準比較。儘管經過完整微調的視覺變換器（Vision Transformers）在校準表現上最佳（平均絕對誤差0.52，誤差在1分以內之準確率91%），零樣本LLM在容許誤差的評分一致性上仍具競爭力（GPT-5的平均絕對誤差0.67，誤差在1分以內之準確率92%），儘管其絕對誤差較高。然而，逐分數分析顯示，所有三個LLM系列均呈現顯著的「中心趨勢效應」（系統性端點壓縮）：預測值系統性地向量表中間值壓縮，導致低分端（0至1分）過度預測，高分端（5至4分）預測不足。此效應對臨床最關鍵的極端分數影響尤為明顯，而這些分數的正確與否對認知障礙篩檢決策最具影響力。針對性消融實驗顯示，無論是使用涵蓋完整分數範圍的少量樣本範例，或是在提示詞中移除臨床術語，均無法消除此效應。我們的研究將「LLM作為評審者」的偏誤文獻從自然語言處理評估擴展至臨床評估領域，並強調在將基於LLM的評估工具部署於高風險篩檢流程前，需進行具校準意識的評估與事後校正。

English

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.