醫療大型語言模型容易受到干擾
Medical large language models are easily distracted
April 1, 2025
作者: Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, Eric Karl Oermann
cs.AI
摘要
大型語言模型(LLMs)具有變革醫學的潛力,但現實世界的臨床場景中充斥著可能影響其表現的無關信息。隨著輔助技術的興起,如環境聽寫(ambient dictation)——它能從即時患者互動中自動生成草稿筆記——這可能引入更多噪音,因此評估LLMs過濾相關數據的能力變得至關重要。為此,我們開發了MedDistractQA,這是一個利用USMLE風格問題嵌入模擬現實世界干擾的基準測試。我們的研究發現,干擾性陳述(如具有臨床意義的多義詞在非臨床語境中的使用,或對不相關健康狀況的提及)可使LLM的準確性降低高達17.9%。常見的改進模型性能的解決方案,如檢索增強生成(RAG)和醫學微調,並未改變這一效應,在某些情況下反而引入了自身的混淆因素,進一步降低了性能。我們的研究結果表明,LLMs天生缺乏區分相關與無關臨床信息所需的邏輯機制,這對其實際應用構成了挑戰。MedDistractQA及我們的研究結果強調了需要制定強有力的緩解策略,以增強LLMs對無關信息的抵抗力。
English
Large language models (LLMs) have the potential to transform medicine, but
real-world clinical scenarios contain extraneous information that can hinder
performance. The rise of assistive technologies like ambient dictation, which
automatically generates draft notes from live patient encounters, has the
potential to introduce additional noise making it crucial to assess the ability
of LLM's to filter relevant data. To investigate this, we developed
MedDistractQA, a benchmark using USMLE-style questions embedded with simulated
real-world distractions. Our findings show that distracting statements
(polysemous words with clinical meanings used in a non-clinical context or
references to unrelated health conditions) can reduce LLM accuracy by up to
17.9%. Commonly proposed solutions to improve model performance such as
retrieval-augmented generation (RAG) and medical fine-tuning did not change
this effect and in some cases introduced their own confounders and further
degraded performance. Our findings suggest that LLMs natively lack the logical
mechanisms necessary to distinguish relevant from irrelevant clinical
information, posing challenges for real-world applications. MedDistractQA and
our results highlights the need for robust mitigation strategies to enhance LLM
resilience to extraneous information.