대규모 언어 모델을 활용한 희귀 질환 감별 진단: 복부 방선균증에서 윌슨병까지

초록

대형 언어 모델(LLM)은 질병 진단 분야에서 인상적인 성능을 보여주고 있습니다. 그러나 본질적으로 진단이 더 어려운 희귀 질환을 식별하는 데 있어서의 효과성은 여전히 미해결된 문제로 남아 있습니다. 의료 현장에서 LLM의 활용이 증가함에 따라 희귀 질환에 대한 성능은 매우 중요해졌습니다. 특히, 주치의가 환자와의 대화만을 기반으로 희귀 질환을 예측하고 적절한 다음 단계를 결정해야 하는 경우에는 더욱 그렇습니다. 이를 위해 여러 임상 의사 결정 지원 시스템이 희귀 질환 식별을 지원하도록 설계되었습니다. 그러나 이러한 시스템들은 일반적인 질환에 대한 지식 부족과 사용의 어려움으로 인해 그 유용성이 제한적입니다. 본 논문에서는 LLM의 지식과 전문가 시스템을 결합한 RareScale을 제안합니다. 우리는 전문가 시스템과 LLM을 함께 사용하여 희귀 질환 대화를 시뮬레이션합니다. 이 데이터는 희귀 질환 후보 예측 모델을 훈련하는 데 사용됩니다. 이 작은 모델에서 생성된 후보들은 블랙박스 LLM에 추가 입력으로 제공되어 최종 차별 진단을 내리게 됩니다. 따라서 RareScale은 희귀 질환과 일반적인 질환 진단 사이의 균형을 유지할 수 있습니다. 우리는 복부 방선균증(Abdominal Actinomycosis)에서 윌슨병(Wilson's Disease)에 이르는 575개 이상의 희귀 질환에 대한 결과를 제시합니다. 우리의 접근 방식은 블랙박스 LLM의 기본 성능을 Top-5 정확도 기준으로 17% 이상 크게 향상시킵니다. 또한, 우리의 후보 생성 성능도 높은 것으로 나타났습니다(예: gpt-4o 생성 대화에서 88.8%).

English

Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. However, their effectiveness in identifying rarer diseases, which are inherently more challenging to diagnose, remains an open question. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. This is especially true if a primary care physician needs to make a rarer prognosis from only a patient conversation so that they can take the appropriate next step. To that end, several clinical decision support systems are designed to support providers in rare disease identification. Yet their utility is limited due to their lack of knowledge of common disorders and difficulty of use. In this paper, we propose RareScale to combine the knowledge LLMs with expert systems. We use jointly use an expert system and LLM to simulate rare disease chats. This data is used to train a rare disease candidate predictor model. Candidates from this smaller model are then used as additional inputs to black-box LLM to make the final differential diagnosis. Thus, RareScale allows for a balance between rare and common diagnoses. We present results on over 575 rare diseases, beginning with Abdominal Actinomycosis and ending with Wilson's Disease. Our approach significantly improves the baseline performance of black-box LLMs by over 17% in Top-5 accuracy. We also find that our candidate generation performance is high (e.g. 88.8% on gpt-4o generated chats).

대규모 언어 모델을 활용한 희귀 질환 감별 진단: 복부 방선균증에서 윌슨병까지

Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease

초록

Support