SymptomAI：日常的な症状評価のための対話型AIエージェントに向けて

要旨

言語モデルは、厳選された医療症例研究やビネットを用いた診断的評価において卓越した性能を発揮し、臨床専門家と同等あるいはそれ以上の成果を示している。しかし、既存の研究は文脈情報が豊富な複雑なシナリオに焦点を当てており、日常的に症状を報告する患者に対してこれらのシステムがどのように機能するかについて結論を導くことは困難である。我々はFitbitアプリを通じて、エンドツーエンドの患者面接と鑑別診断（DDx）を行う対話型AIエージェント群「SymptomAI」を展開し、13,917名の参加者を5つのAIエージェントと対話する群に無作為割り付けた。このコーパスは、実世界の集団から得られた多様なコミュニケーションと現実的な疾病分布を捉えている。1,228名のサブセット参加者が医師による診断を報告し、このうち517名については臨床医パネルによる250時間超の注釈作業で詳細に評価された。二重盲検無作為比較試験において、SymptomAIのDDxは同一の対話データを与えられた独立した臨床医の診断よりも有意に精度が高かった（オッズ比=2.47、p < 0.001）。さらに、診断を提示する前に追加の症状情報を引き出す専用の症状面接を行うエージェント戦略は、ユーザー主導型会話をベースラインとした場合よりも実質的に優れた性能を示した（p < 0.001）。米国一般人口パネルからの1,509対話を用いた補助分析により、この結果がウェアラブルデバイスユーザーを超えて一般化可能であることが検証された。我々は全13,917名の参加者に対してSymptomAIの診断をラベルとして用い、約400種類の疾患にわたる50万日分のウェアラブル指標を分析した。急性感染症と生理的変動との間に強い関連性を特定し（例：インフルエンザでオッズ比>7）、自己報告に基づく真値という限界はあるものの、専用的かつ完全な症状面接が、大多数の消費者向け大規模言語モデルで標準となっているユーザー主導型症状議論よりも有益であることを実証した。

English

Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.47, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

SymptomAI：日常的な症状評価のための対話型AIエージェントに向けて

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

要旨

Support