ChatPaper.aiChatPaper

醫療大型語言模型容易受到干擾

Medical large language models are easily distracted

April 1, 2025
作者: Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, Eric Karl Oermann
cs.AI

摘要

大型語言模型(LLMs)具有變革醫學的潛力,但現實世界的臨床場景中充斥著可能影響其表現的無關信息。隨著輔助技術的興起,如環境聽寫(ambient dictation)——它能從即時患者互動中自動生成草稿筆記——這可能引入更多噪音,因此評估LLMs過濾相關數據的能力變得至關重要。為此,我們開發了MedDistractQA,這是一個利用USMLE風格問題嵌入模擬現實世界干擾的基準測試。我們的研究發現,干擾性陳述(如具有臨床意義的多義詞在非臨床語境中的使用,或對不相關健康狀況的提及)可使LLM的準確性降低高達17.9%。常見的改進模型性能的解決方案,如檢索增強生成(RAG)和醫學微調,並未改變這一效應,在某些情況下反而引入了自身的混淆因素,進一步降低了性能。我們的研究結果表明,LLMs天生缺乏區分相關與無關臨床信息所需的邏輯機制,這對其實際應用構成了挑戰。MedDistractQA及我們的研究結果強調了需要制定強有力的緩解策略,以增強LLMs對無關信息的抵抗力。
English
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.
PDF32April 3, 2025