언어 모델은 생의학 벤치마크에서 약물 이름에 놀라울 정도로 취약하다

초록

의학 지식은 문맥에 의존적이며, 의미적으로 동등한 구문의 다양한 자연어 표현에 걸쳐 일관된 추론이 요구됩니다. 이는 특히 약물 이름에서 중요한데, 환자들이 종종 제네릭 명칭 대신 어드빌(Advil)이나 타이레놀(Tylenol)과 같은 상표명을 사용하기 때문입니다. 이를 연구하기 위해, 우리는 의사 전문가의 주석을 활용하여 상표명과 제네릭 약물 이름을 교체한 후 의학 벤치마크에서의 성능 차이를 평가하기 위한 새로운 견고성 데이터셋인 RABBITS를 생성했습니다. 우리는 MedQA와 MedMCQA에서 오픈소스 및 API 기반 대형 언어 모델(LLM)을 평가하여 1-10%에 이르는 일관된 성능 하락을 확인했습니다. 더 나아가, 우리는 이러한 취약성의 잠재적 원인으로 널리 사용되는 사전 학습 데이터셋에서 테스트 데이터의 오염을 지목했습니다. 모든 코드는 https://github.com/BittermanLab/RABBITS에서 접근 가능하며, HuggingFace 리더보드는 https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard에서 확인할 수 있습니다.

English

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.

언어 모델은 생의학 벤치마크에서 약물 이름에 놀라울 정도로 취약하다

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

초록

Support