症状AI：面向日常症状评估的对话式人工智能助手

摘要

语言模型在精心设计的医学案例研究和情景测试中表现优异，其诊断评估能力达到甚至超过临床专业人员水平。然而，现有研究多聚焦于背景信息丰富的复杂场景，难以据此判断这些系统在日常症状报告情境中的实际表现。我们通过Fitbit应用部署了SymptomAI——一套用于端到端患者问诊与鉴别诊断的对话式AI代理，在一项纳入13,917名参与者的随机研究中让受试者与五款AI代理进行交互。该语料库捕捉了真实世界人群的多样化沟通方式及疾病分布特征。在1,228名报告医生诊断结果的参与者子集中，有517人经过临床专家小组累计超250小时的标注评估。盲法随机对照显示，基于相同对话记录，SymptomAI的鉴别诊断准确率显著高于独立临床医生（OR=2.47, p<0.001）。相较于用户主导对话的基线模式，采用专项症状问诊策略（即在诊断前主动获取额外症状信息）的AI代理表现显著更优（p<0.001）。针对美国普通人群小组1,509次对话的辅助分析证实，该结论可推广至可穿戴设备用户之外的人群。我们以SymptomAI的诊断结果作为全部13,917名参与者的标签，分析了近400种独特病症下超过50万日的可穿戴指标数据，发现急性感染与生理指标变化存在强关联（如流感OR值>7）。尽管受限于自我报告的真实值标注，这些结果仍证明专项完整症状问诊相较于当前消费级大语言模型默认的用户主导式症状讨论具有显著优势。

English

Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.47, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

症状AI：面向日常症状评估的对话式人工智能助手

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

摘要

Support