MedQ-Bench: MLLM에서의 의료 영상 품질 평가 능력 평가 및 탐구

초록

의료 영상 품질 평가(IQA)는 임상 AI의 첫 번째 안전 장치 역할을 하지만, 기존 접근법들은 스칼라 점수 기반 메트릭에 제한되어 있으며 전문가 평가의 핵심인 서술적이고 인간과 유사한 추론 과정을 반영하지 못하고 있습니다. 이러한 격차를 해결하기 위해, 우리는 다중 모드 대형 언어 모델(MLLMs)을 기반으로 한 의료 영상 품질 평가를 위한 인지-추론 패러다임을 확립하는 종합 벤치마크인 MedQ-Bench를 소개합니다. MedQ-Bench는 두 가지 상호 보완적인 과제를 정의합니다: (1) MedQ-Perception은 기본 시각 속성에 대한 인간이 선별한 질문을 통해 저수준 인지 능력을 탐구하고, (2) MedQ-Reasoning은 참조 없음 및 비교 추론 과제를 포함하여 모델 평가를 인간과 유사한 영상 품질 추론과 일치시킵니다. 이 벤치마크는 5가지 영상 모달리티와 40개 이상의 품질 속성을 포괄하며, 총 2,600개의 인지 질문과 708개의 추론 평가로 구성되어 있습니다. 여기에는 실제 임상 획득 영상, 물리 기반 재구성을 통해 시뮬레이션된 저하 영상, 그리고 AI 생성 영상 등 다양한 영상 소스가 포함됩니다. 추론 능력을 평가하기 위해, 우리는 모델 출력을 네 가지 상호 보완적인 축을 따라 평가하는 다차원 판단 프로토콜을 제안합니다. 또한, 우리는 LLM 기반 판단과 방사선 전문가의 판단을 비교하여 엄격한 인간-AI 정렬 검증을 수행합니다. 14개의 최신 MLLMs에 대한 평가 결과, 모델들은 예비적이지만 불안정한 인지 및 추론 능력을 보여주며, 신뢰할 수 있는 임상 사용을 위한 충분한 정확도를 갖추지 못했습니다. 이러한 결과는 의료 IQA에서 MLLMs의 목표 지향적 최적화의 필요성을 강조합니다. 우리는 MedQ-Bench가 더 많은 탐구를 촉발하고 의료 영상 품질 평가를 위한 MLLMs의 잠재력을 개방할 수 있기를 바랍니다.

English

Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

MedQ-Bench: MLLM에서의 의료 영상 품질 평가 능력 평가 및 탐구

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

초록

Support