マルチモーダルLLM評価者の監査：臨床的順序スコアリングにおける中心傾向バイアス

要旨

マルチモーダル大規模言語モデル（LLM）は、臨床現場での自動評価器としてますます研究されているが、順序臨床尺度におけるスコアリング行動は依然として十分に理解されていない。我々は、Shulman評価基準を用いて2つの公開データセット上の時計描画テスト（CDT）画像を評価する際に、3つの最先端LLMファミリーを教師あり深層学習モデルと比較評価した。完全にファインチューニングされたVision Transformerが最良のキャリブレーション（MAE 0.52、within-1精度91%）を達成する一方で、ゼロショットLLMは絶対誤差が大きいにもかかわらず、許容範囲に基づく一致（GPT-5 MAE 0.67、within-1精度92%）において競争力を維持している。しかし、スコア別分析により、3つのLLMファミリーすべてが顕著な中心傾向効果（系統的な端点圧縮）を示すことが明らかになった。予測が系統的に尺度の中央に向かって圧縮され、低スコア側（スコア0から1）では過大予測、高スコア側（スコア5から4）では過小予測が見られる。この効果は、正確なスコアリングが認知障害のスクリーニング判断に最も影響を与える臨床的に重要な極端なスコアに不均衡に影響を及ぼす。対象を絞ったアブレーション実験により、全スコア範囲をカバーする少数ショットの例示も、プロンプトから臨床用語を除去することも、この効果を排除できないことが示された。我々の知見は、LLMを判定者として用いる際のバイアスに関する文献をNLP評価から臨床評価に拡張し、重大な結果を伴うスクリーニングワークフローにLLMベースの評価者を導入する前に、キャリブレーションを考慮した評価と事後的キャリブレーションの必要性を強調する。

English

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.