语言模型在生物医学基准测试中对药物名称非常脆弱。
Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks
June 17, 2024
作者: Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman
cs.AI
摘要
医学知识是依赖于语境的,需要在各种自然语言表达中保持一致的推理,尤其是对于药物名称,患者通常使用如阿德维尔(Advil)或泰诺(Tylenol)等商标名称,而非它们的通用等价物。为研究这一点,我们创建了一个新的稳健数据集RABBITS,通过医师专家注释交换品牌和通用药物名称,以评估在医学基准上性能差异。
我们评估了开源和基于API的LLMs在MedQA和MedMCQA上的表现,揭示了一致的性能下降,范围在1-10\%之间。此外,我们确定了这种脆弱性的一个潜在来源,即在广泛使用的预训练数据集中测试数据的污染。所有代码都可以在https://github.com/BittermanLab/RABBITS找到,HuggingFace排行榜可在https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard上找到。
English
Medical knowledge is context-dependent and requires consistent reasoning
across various natural language expressions of semantically equivalent phrases.
This is particularly crucial for drug names, where patients often use brand
names like Advil or Tylenol instead of their generic equivalents. To study
this, we create a new robustness dataset, RABBITS, to evaluate performance
differences on medical benchmarks after swapping brand and generic drug names
using physician expert annotations.
We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing
a consistent performance drop ranging from 1-10\%. Furthermore, we identify a
potential source of this fragility as the contamination of test data in widely
used pre-training datasets. All code is accessible at
https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is
available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.Summary
AI-Generated Summary