ChatPaper.aiChatPaper

MedBLINK:探索医学多模态语言模型中的基础感知能力

MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

August 4, 2025
作者: Mahtab Bigverdi, Wisdom Ikezogwo, Kevin Zhang, Hyewon Jeong, Mingyu Lu, Sungjae Cho, Linda Shapiro, Ranjay Krishna
cs.AI

摘要

多模态语言模型(MLMs)在临床决策支持和诊断推理方面展现出潜力,预示着端到端自动化医学图像解读的前景。然而,临床医生在采用AI工具时极为审慎;一个在诸如判断图像方向或识别CT扫描是否经过对比增强等看似简单的感知任务上出错模型,不太可能被采纳用于临床任务。我们推出了Medblink,一个旨在探测这些模型此类感知能力的基准。Medblink涵盖跨多种成像模式和解剖区域的八项临床意义任务,总计包含1,605张图像上的1,429道选择题。我们评估了19个最先进的MLMs,包括通用型(如GPT4o、Claude 3.5 Sonnet)和领域专用型(如Med Flamingo、LLaVA Med、RadFM)模型。尽管人类标注者达到了96.4%的准确率,表现最佳的模型仅达到65%。这些结果表明,当前MLMs在常规感知检查中频繁失误,提示需加强其视觉基础以支持临床采纳。数据可在我们的项目页面上获取。
English
Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page.
PDF01August 8, 2025