语言模型在生物医学基准测试中对药物名称非常脆弱。

摘要

医学知识是依赖于语境的，需要在各种自然语言表达中保持一致的推理，尤其是对于药物名称，患者通常使用如阿德维尔（Advil）或泰诺（Tylenol）等商标名称，而非它们的通用等价物。为研究这一点，我们创建了一个新的稳健数据集RABBITS，通过医师专家注释交换品牌和通用药物名称，以评估在医学基准上性能差异。我们评估了开源和基于API的LLMs在MedQA和MedMCQA上的表现，揭示了一致的性能下降，范围在1-10\%之间。此外，我们确定了这种脆弱性的一个潜在来源，即在广泛使用的预训练数据集中测试数据的污染。所有代码都可以在https://github.com/BittermanLab/RABBITS找到，HuggingFace排行榜可在https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard上找到。

English

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.

语言模型在生物医学基准测试中对药物名称非常脆弱。

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

摘要

Support