MedQ-Bench:评估与探索多模态大语言模型在医学图像质量评估中的能力
MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs
October 2, 2025
作者: Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning, Yunheng Li, Ying Chen, Xinzhe Luo, Pengcheng Chen, Xin Gao, Ming Hu, Huihui Xu, Xin Wang, Shujian Gao, Dingkang Yang, Zhongying Deng, Jin Ye, Lihao Liu, Junjun He, Ningsheng Xu
cs.AI
摘要
医学图像质量评估(IQA)作为临床AI的第一道安全关卡,现有方法仍受限于基于标量评分的指标,无法反映专家评估中核心的描述性、类人推理过程。为填补这一空白,我们推出了MedQ-Bench,一个全面的基准测试,它通过多模态大语言模型(MLLMs)建立了基于语言的医学图像质量评估的感知-推理范式。MedQ-Bench定义了两项互补任务:(1) MedQ-感知,通过人类精心策划的关于基础视觉属性的问题,探究低层次感知能力;(2) MedQ-推理,包含无参考和比较推理任务,使模型评估与人类对图像质量的推理方式对齐。该基准覆盖五种成像模态及超过四十项质量属性,总计2600个感知查询和708项推理评估,涵盖多样化的图像来源,包括真实临床采集、基于物理重建的模拟退化图像及AI生成图像。为评估推理能力,我们提出了一套多维度评判协议,从四个互补维度评估模型输出。我们进一步通过比较基于LLM的评判与放射科医生的判断,进行了严格的人机一致性验证。对14种最先进MLLMs的评估显示,模型展现出初步但不稳定的感知与推理能力,其准确性尚不足以可靠地应用于临床。这些发现强调了在医学IQA领域对MLLMs进行针对性优化的必要性。我们期望MedQ-Bench能激发更多探索,释放MLLMs在医学图像质量评估中的未开发潜力。
English
Medical Image Quality Assessment (IQA) serves as the first-mile safety gate
for clinical AI, yet existing approaches remain constrained by scalar,
score-based metrics and fail to reflect the descriptive, human-like reasoning
process central to expert evaluation. To address this gap, we introduce
MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning
paradigm for language-based evaluation of medical image quality with
Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary
tasks: (1) MedQ-Perception, which probes low-level perceptual capability via
human-curated questions on fundamental visual attributes; and (2)
MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks,
aligning model evaluation with human-like reasoning on image quality. The
benchmark spans five imaging modalities and over forty quality attributes,
totaling 2,600 perceptual queries and 708 reasoning assessments, covering
diverse image sources including authentic clinical acquisitions, images with
simulated degradations via physics-based reconstructions, and AI-generated
images. To evaluate reasoning ability, we propose a multi-dimensional judging
protocol that assesses model outputs along four complementary axes. We further
conduct rigorous human-AI alignment validation by comparing LLM-based judgement
with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates
that models exhibit preliminary but unstable perceptual and reasoning skills,
with insufficient accuracy for reliable clinical use. These findings highlight
the need for targeted optimization of MLLMs in medical IQA. We hope that
MedQ-Bench will catalyze further exploration and unlock the untapped potential
of MLLMs for medical image quality evaluation.