ChatPaper.aiChatPaper

在生物醫學基準測試中,語言模型對於藥物名稱的脆弱性令人驚訝。

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

June 17, 2024
作者: Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman
cs.AI

摘要

醫學知識是依賴上下文並需要在各種自然語言表達中保持一致推理的。這對於藥物名稱尤其重要,因為患者通常使用像Advil或Tylenol這樣的品牌名稱,而非它們的通用等效物。為了研究這一點,我們創建了一個新的韌性數據集RABBITS,以評估在醫學基準測試中交換品牌和通用藥物名稱後的性能差異,並使用醫師專家的註釋。 我們評估了MedQA和MedMCQA上的開源和基於API的LLMs,在這些測試中換算品牌和通用藥物名稱後,揭示了一致的性能下降,範圍在1-10\%之間。此外,我們識別出這種脆弱性的一個潛在來源是廣泛使用的預訓練數據集中測試數據的污染。所有代碼都可以在https://github.com/BittermanLab/RABBITS找到,HuggingFace排行榜可在https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard上查看。
English
Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.

Summary

AI-Generated Summary

PDF81December 4, 2024