ChatPaper.aiChatPaper

罕見疾病大規模鑑別診斷與大型語言模型應用: 從腹部放線菌病到威爾森氏症

Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease

February 20, 2025
作者: Elliot Schumacher, Dhruv Naik, Anitha Kannan
cs.AI

摘要

大型語言模型(LLMs)在疾病診斷方面展現了令人矚目的能力。然而,其在識別較為罕見的疾病——這些疾病本身診斷難度更大——方面的有效性,仍是一個未解之謎。隨著LLMs在醫療保健環境中的應用日益增多,罕見疾病的診斷性能顯得尤為關鍵。特別是當基層醫療醫生僅需通過與患者的對話來做出較為罕見的預後判斷,以便採取適當的後續步驟時,這一點尤為重要。為此,多種臨床決策支持系統被設計出來,旨在協助醫療提供者識別罕見疾病。然而,由於這些系統對常見疾病的了解不足以及使用上的困難,其效用受到限制。 本文提出RareScale,旨在將LLMs的知識與專家系統相結合。我們聯合使用專家系統和LLM來模擬罕見疾病的對話。這些數據被用來訓練一個罕見疾病候選預測模型。隨後,這個較小模型生成的候選診斷被作為額外輸入,提供給黑箱LLM以做出最終的鑑別診斷。因此,RareScale實現了罕見與常見診斷之間的平衡。我們展示了涵蓋575種以上罕見疾病的結果,從腹部放線菌病開始,至威爾森氏病結束。我們的方法顯著提升了黑箱LLMs的基準性能,在Top-5準確率上提高了超過17%。此外,我們發現候選生成性能表現優異(例如,在gpt-4o生成的對話中達到88.8%)。
English
Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. However, their effectiveness in identifying rarer diseases, which are inherently more challenging to diagnose, remains an open question. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. This is especially true if a primary care physician needs to make a rarer prognosis from only a patient conversation so that they can take the appropriate next step. To that end, several clinical decision support systems are designed to support providers in rare disease identification. Yet their utility is limited due to their lack of knowledge of common disorders and difficulty of use. In this paper, we propose RareScale to combine the knowledge LLMs with expert systems. We use jointly use an expert system and LLM to simulate rare disease chats. This data is used to train a rare disease candidate predictor model. Candidates from this smaller model are then used as additional inputs to black-box LLM to make the final differential diagnosis. Thus, RareScale allows for a balance between rare and common diagnoses. We present results on over 575 rare diseases, beginning with Abdominal Actinomycosis and ending with Wilson's Disease. Our approach significantly improves the baseline performance of black-box LLMs by over 17% in Top-5 accuracy. We also find that our candidate generation performance is high (e.g. 88.8% on gpt-4o generated chats).

Summary

AI-Generated Summary

PDF22February 24, 2025