LingxiDiagBench: 中国の精神科診療・診断におけるLLMを評価するためのマルチエージェントフレームワーク

要旨

精神疾患は世界中で高い有病率を示しているが、精神科医の不足や面接による診断に内在する主観性により、タイムリーで一貫性のある精神保健評価には大きな障壁が存在する。AI支援による精神科診断の進展は、現実的な患者シミュレーション、臨床医が確認した診断ラベル、動的なマルチターン診察への対応を同時に提供するベンチマークが欠如していることによって制約されている。本稿では、中国語における静的診断推論と動的マルチターン精神科診察の両方でLLMを評価する、大規模マルチエージェントベンチマークであるLingxiDiagBenchを紹介する。その中核をなすのがLingxiDiag-16Kであり、12のICD-10精神科カテゴリーにわたる実際の臨床的人口統計学的分布および診断分布を再現するように設計された、EMRに準拠した合成診察対話1万6000件からなるデータセットである。最先端のLLMを用いた広範な実験を通じて、以下の重要な知見を得た。(1) LLMはうつ病と不安障害の二値分類では高い精度（最大92.3%）を達成するが、うつ病と不安障害の併存症認識（43.0%）および12クラスの鑑別診断（28.5%）では精度が大幅に低下する。(2) 動的診察は静的評価よりも低いパフォーマンスを示すことが多く、非効率な情報収集戦略が下流の診断推論を著しく損なうことを示唆している。(3) LLM-as-a-Judgeによって評価された診察の質は、診断精度と中程度の相関しか示さず、適切に構造化された質問だけでは正しい診断判断が保証されないことを示唆している。再現可能な研究を支援するため、LingxiDiag-16Kと完全な評価フレームワークをhttps://github.com/Lingxi-mental-health/LingxiDiagBench で公開する。

English

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.