LingxiDiagBench: Een multi-agent framework voor het benchmarken van LLM's in Chinese psychiatrische consultatie en diagnose

Samenvatting

Psychische stoornissen komen wereldwijd veel voor, maar het tekort aan psychiaters en de inherente subjectiviteit van interviewgebaseerde diagnose vormen aanzienlijke barrières voor tijdige en consistente geestelijke gezondheidsbeoordeling. De vooruitgang bij AI-ondersteunde psychiatrische diagnose wordt beperkt door het ontbreken van benchmarks die tegelijkertijd realistische patiëntsimulatie, door clinici geverifieerde diagnostische labels en ondersteuning voor dynamische multi-turn consultatie bieden. Wij presenteren LingxiDiagBench, een grootschalige multi-agent benchmark die LLM's evalueert op zowel statische diagnostische inferentie als dynamische multi-turn psychiatrische consultatie in het Chinees. De kern is LingxiDiag-16K, een dataset van 16.000 EMR-afgestemde synthetische consultatiedialogen die zijn ontworpen om de reële klinische demografische en diagnostische verdelingen over 12 ICD-10 psychiatrische categorieën te reproduceren. Door middel van uitgebreide experimenten met state-of-the-art LLM's leggen we de volgende belangrijke bevindingen vast: (1) hoewel LLM's een hoge nauwkeurigheid behalen bij binaire depressie-angstclassificatie (tot 92,3%), verslechtert de prestatie aanzienlijk bij herkenning van comorbiditeit van depressie en angst (43,0%) en differentiële diagnose over 12 categorieën (28,5%); (2) dynamische consultatie presteert vaak slechter dan statische evaluatie, wat erop wijst dat ineffectieve informatieverzamelingsstrategieën de downstream diagnostische redenering aanzienlijk belemmeren; (3) de consultatiekwaliteit beoordeeld door LLM-als-beoordelaar vertoont slechts een matige correlatie met diagnostische nauwkeurigheid, wat suggereert dat goed gestructureerd vragen stellen alleen niet voldoende is voor correcte diagnostische beslissingen. We publiceren LingxiDiag-16K en het volledige evaluatieraamwerk om reproduceerbaar onderzoek te ondersteunen op https://github.com/Lingxi-mental-health/LingxiDiagBench.

English

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.