使用私人微調的大型語言模型在病人病歷上進行問答

摘要

醫療系統持續產生大量的電子健康記錄（EHRs），通常存儲在快速醫療互操作性資源（FHIR）標準中。儘管這些記錄中包含豐富的信息，但其複雜性和數量使用戶難以檢索和解釋關鍵的健康見解。最近大型語言模型（LLMs）的進步提供了一個解決方案，實現對醫療數據的語義問答（QA），使用戶能夠更有效地與其健康記錄互動。然而，確保隱私和合規性需要在邊緣和私有部署LLMs。本文提出了一種新的方法，首先通過識別對用戶查詢最相關的FHIR資源（任務1），然後基於這些資源回答查詢（任務2）來實現對EHRs的語義QA。我們探索了私人託管、精調LLMs的性能，將它們與GPT-4和GPT-4o等基準模型進行評估。我們的結果表明，精調LLMs的大小是GPT-4系列模型的250倍，其在任務1的F1分數上超過了0.55％，在任務2的Meteor任務上超過了42％。此外，我們還研究了LLM使用的高級方面，包括序列精調、模型自我評估（自戀評估）以及訓練數據大小對性能的影響。模型和數據集可在此處找到：https://huggingface.co/genloop

English

Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively. However, ensuring privacy and compliance requires edge and private deployments of LLMs. This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: https://huggingface.co/genloop

使用私人微調的大型語言模型在病人病歷上進行問答

Question Answering on Patient Medical Records with Private Fine-Tuned LLMs

摘要

Support