ChatPaper.aiChatPaper

MedBLINK:探討醫學多模態語言模型中的基礎感知能力

MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

August 4, 2025
作者: Mahtab Bigverdi, Wisdom Ikezogwo, Kevin Zhang, Hyewon Jeong, Mingyu Lu, Sungjae Cho, Linda Shapiro, Ranjay Krishna
cs.AI

摘要

多模态语言模型(MLMs)在临床决策支持与诊断推理方面展现出潜力,预示着端到端自动化医学图像解读的前景。然而,临床医生在采纳人工智能工具时极为审慎;一个在诸如判断图像方向或识别CT扫描是否增强对比等看似简单的感知任务上出错的模型,不太可能被应用于临床任务。我们推出了Medblink,一个旨在探测这些模型此类感知能力的基准测试。Medblink涵盖了跨多种成像模式和解剖区域的八项具有临床意义的任务,总计包含1,605张图像上的1,429道选择题。我们评估了19个最先进的MLMs,包括通用型(如GPT4o、Claude 3.5 Sonnet)和领域专用型(如Med Flamingo、LLaVA Med、RadFM)模型。尽管人类标注者达到了96.4%的准确率,表现最佳的模型仅达到65%。这些结果表明,当前的MLMs在常规感知检查中频繁失误,提示需加强其视觉基础以支持临床应用。数据可在我们的项目页面上获取。
English
Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page.
PDF01August 8, 2025