灵犀诊断基准：一个用于在中文精神科咨询与诊断中评测大语言模型的多智能体框架

摘要

精神障碍在全球范围内高度流行，但精神科医生的短缺以及基于面谈诊断固有的主观性，严重阻碍了及时且一致的心理健康评估。人工智能辅助精神疾病诊断的进展受到缺乏基准的制约——这些基准需同时提供逼真的患者模拟、临床医生验证的诊断标签，并支持动态多轮问诊。我们提出LingxiDiagBench，这是一个大规模多智能体基准，用于评估大语言模型在中文环境下静态诊断推理与动态多轮精神科问诊两方面的能力。其核心是LingxiDiag-16K数据集，包含16,000段与电子病历对齐的合成问诊对话，旨在复现12个ICD-10精神疾病类别下的真实临床人口学与诊断分布。通过在先进的大语言模型上开展大量实验，我们得出关键发现：(1) 尽管大语言模型在二元抑郁-焦虑分类上准确率较高（最高达92.3%），但在抑郁-焦虑共病识别（43.0%）和12类鉴别诊断（28.5%）上性能显著下降；(2) 动态问诊的表现通常低于静态评估，表明无效的信息收集策略严重损害下游诊断推理；(3) 由大语言模型作为评判者评估的问诊质量与诊断准确性仅呈中等程度相关，这提示结构良好的提问本身并不能确保正确的诊断决策。我们公开发布LingxiDiag-16K及完整评估框架，以支持可复现的研究，访问地址：https://github.com/Lingxi-mental-health/LingxiDiagBench。

English

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.