言語モデルは、生物医学ベンチマークにおける薬剤名に対して驚くほど脆弱である

要旨

医療知識は文脈依存であり、意味的に等価なフレーズのさまざまな自然言語表現にわたって一貫した推論を必要とします。これは特に薬剤名において重要であり、患者はしばしばジェネリック医薬品の代わりにAdvilやTylenolといったブランド名を使用します。これを研究するため、我々は医師の専門的な注釈を用いてブランド名とジェネリック医薬品名を入れ替えた後、医療ベンチマークでの性能差を評価する新しいロバストネスデータセット、RABBITSを作成しました。我々はMedQAとMedMCQAにおいて、オープンソースおよびAPIベースの大規模言語モデル（LLM）を評価し、1～10％の一貫した性能低下を明らかにしました。さらに、この脆弱性の潜在的な原因として、広く使用されている事前学習データセットにおけるテストデータの汚染を特定しました。すべてのコードはhttps://github.com/BittermanLab/RABBITSでアクセス可能であり、HuggingFaceリーダーボードはhttps://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboardで利用できます。

English

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.

言語モデルは、生物医学ベンチマークにおける薬剤名に対して驚くほど脆弱である

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

要旨

Support