MedBLINK: 의학 분야 다중모달 언어 모델의 기본 인지 능력 탐구

초록

다중모달 언어 모델(MLMs)은 임상 의사결정 지원 및 진단 추론에 있어 유망한 가능성을 보여주며, 종단 간 자동화된 의료 영상 해석의 전망을 제시합니다. 그러나 임상의들은 AI 도구를 채택하는 데 있어 매우 선택적입니다. 이미지 방향을 결정하거나 CT 스캔이 조영증강되었는지 여부를 식별하는 것과 같이 겉보기에는 단순한 인지 작업에서 오류를 내는 모델은 임상 작업에 채택되기 어렵습니다. 우리는 이러한 모델들의 인지 능력을 탐구하기 위해 설계된 벤치마크인 Medblink를 소개합니다. Medblink는 여러 영상 방식과 해부학적 영역에 걸친 8가지 임상적으로 의미 있는 작업을 포함하며, 총 1,605개의 이미지에 대해 1,429개의 객관식 질문으로 구성됩니다. 우리는 GPT4o, Claude 3.5 Sonnet과 같은 일반 목적 모델과 Med Flamingo, LLaVA Med, RadFM과 같은 도메인 특화 모델을 포함한 19개의 최첨단 MLMs를 평가했습니다. 인간 주석자는 96.4%의 정확도를 달성한 반면, 최고 성능 모델은 단 65%에 그쳤습니다. 이러한 결과는 현재의 MLMs가 일상적인 인지 검사에서 자주 실패함을 보여주며, 임상 채택을 지원하기 위해 시각적 기반을 강화할 필요가 있음을 시사합니다. 데이터는 프로젝트 페이지에서 확인할 수 있습니다.

English

Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page.

MedBLINK: 의학 분야 다중모달 언어 모델의 기본 인지 능력 탐구

MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

초록

Support