大規模言語モデルを用いた希少疾患の鑑別診断：腹部アクチノミコーシスからウィルソン病まで

要旨

大規模言語モデル（LLMs）は、疾患診断において印象的な能力を発揮している。しかし、診断が本質的に困難な希少疾患の特定における有効性は、未だに未解決の問題である。医療現場でのLLMsの使用が増加する中、希少疾患に対する性能は極めて重要である。特に、一次診療医が患者との会話のみから希少な予後を判断し、適切な次のステップを踏む必要がある場合には、その重要性がさらに高まる。この目的のために、いくつかの臨床意思決定支援システムが、希少疾患の特定を支援するように設計されている。しかし、これらのシステムは、一般的な疾患に関する知識の欠如や使用の難しさから、その有用性が限られている。本論文では、LLMsの知識と専門家システムを組み合わせたRareScaleを提案する。専門家システムとLLMを併用して、希少疾患に関する模擬会話を生成する。このデータを用いて、希少疾患候補予測モデルを訓練する。この小規模モデルから得られた候補を、ブラックボックスLLMへの追加入力として使用し、最終的な鑑別診断を行う。これにより、RareScaleは希少診断と一般的な診断のバランスを取ることができる。我々は、腹部アクチノミコーシスからウィルソン病までの575以上の希少疾患に関する結果を示す。我々のアプローチは、ブラックボックスLLMsのベースライン性能をTop-5精度で17%以上向上させた。また、候補生成の性能も高いことが確認された（例：gpt-4o生成の会話で88.8%）。

English

Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. However, their effectiveness in identifying rarer diseases, which are inherently more challenging to diagnose, remains an open question. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. This is especially true if a primary care physician needs to make a rarer prognosis from only a patient conversation so that they can take the appropriate next step. To that end, several clinical decision support systems are designed to support providers in rare disease identification. Yet their utility is limited due to their lack of knowledge of common disorders and difficulty of use. In this paper, we propose RareScale to combine the knowledge LLMs with expert systems. We use jointly use an expert system and LLM to simulate rare disease chats. This data is used to train a rare disease candidate predictor model. Candidates from this smaller model are then used as additional inputs to black-box LLM to make the final differential diagnosis. Thus, RareScale allows for a balance between rare and common diagnoses. We present results on over 575 rare diseases, beginning with Abdominal Actinomycosis and ending with Wilson's Disease. Our approach significantly improves the baseline performance of black-box LLMs by over 17% in Top-5 accuracy. We also find that our candidate generation performance is high (e.g. 88.8% on gpt-4o generated chats).

大規模言語モデルを用いた希少疾患の鑑別診断：腹部アクチノミコーシスからウィルソン病まで

Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease

要旨

Support