MedQ-Bench：評估與探索多模態大語言模型在醫學影像質量評估中的能力

摘要

醫學影像質量評估（IQA）作為臨床AI的首道安全關卡，現有方法仍受制於基於標量分數的度量標準，未能反映專家評估中核心的描述性、類人推理過程。為彌補這一不足，我們引入了MedQ-Bench，這是一個全面的基準測試，它為基於多模態大語言模型（MLLMs）的醫學影像質量語言評估建立了一個感知-推理範式。MedQ-Bench定義了兩項互補任務：(1) MedQ-Perception，通過人工策劃的關於基本視覺屬性的問題來探測低層次感知能力；(2) MedQ-Reasoning，涵蓋無參考和比較推理任務，使模型評估與人類對影像質量的推理方式保持一致。該基準涵蓋五種成像模態及超過四十種質量屬性，總計2600個感知查詢和708項推理評估，覆蓋了包括真實臨床採集、基於物理重建模擬退化的影像及AI生成影像在內的多樣化影像來源。為評估推理能力，我們提出了一個多維度評判協議，沿四個互補軸線評估模型輸出。我們進一步通過比較基於LLM的判斷與放射科醫生的判斷，進行了嚴格的人機對齊驗證。對14種前沿MLLMs的評估顯示，這些模型展現了初步但不穩定的感知與推理能力，其準確性尚不足以可靠地用於臨床。這些發現強調了在醫學IQA中針對MLLMs進行定向優化的必要性。我們希望MedQ-Bench能激發更多探索，釋放MLLMs在醫學影像質量評估中的未開發潛力。

English

Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

MedQ-Bench：評估與探索多模態大語言模型在醫學影像質量評估中的能力

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

摘要

Support