LingxiDiagBench: 중국 정신과 상담 및 진단에서 LLM 벤치마킹을 위한 다중 에이전트 프레임워크

초록

정신 질환은 전 세계적으로 유병률이 매우 높지만, 정신과 의사의 부족과 면담 기반 진단의 내재적 주관성으로 인해 시의적절하고 일관된 정신 건강 평가에 상당한 장애가 존재합니다. AI 기반 정신과 진단의 발전은 현실적인 환자 시뮬레이션, 임상의가 검증한 진단 레이블, 그리고 동적 다회차 상담 지원을 동시에 제공하는 벤치마크의 부재로 인해 제약을 받고 있습니다. 본 연구에서는 중국어 기반의 정적 진단 추론과 동적 다회차 정신과 상담 모두에서 LLM을 평가하는 대규모 멀티에이전트 벤치마크인 LingxiDiagBench를 제시합니다. 이 벤치마크의 핵심에는 12개의 ICD-10 정신과 범주에 걸쳐 실제 임상 인구통계학적 및 진단적 분포를 재현하도록 설계된 16,000개의 EMR 정렬 합성 상담 대화 데이터셋인 LingxiDiag-16K가 있습니다. 최첨단 LLM을 대상으로 한 광범위한 실험을 통해 다음과 같은 주요 결과를 도출했습니다. (1) LLM은 이분법적 우울증-불안 분류에서 높은 정확도(최대 92.3%)를 달성하지만, 우울증-불안 동반 질환 인식(43.0%) 및 12개 항목 감별 진단(28.5%)에서는 성능이 현저히 저하됩니다. (2) 동적 상담은 종종 정적 평가보다 낮은 성능을 보이며, 이는 비효율적인 정보 수집 전략이 하위 진단 추론을 심각하게 저해함을 나타냅니다. (3) LLM-as-a-Judge로 평가한 상담 품질은 진단 정확도와 중간 정도의 상관관계만 보여, 잘 구조화된 질문만으로는 올바른 진단 결정이 보장되지 않음을 시사합니다. 재현 가능한 연구를 지원하기 위해 LingxiDiag-16K와 전체 평가 프레임워크를 https://github.com/Lingxi-mental-health/LingxiDiagBench 에서 공개합니다.

English

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.