MedQ-Bench: MLLMにおける医用画像品質評価能力の検証と探求

要旨

医療画像品質評価（IQA）は、臨床AIにおける最初の安全ゲートとして機能するが、既存のアプローチはスカラー値に基づくスコア指標に制限されており、専門家評価の中核となる記述的で人間のような推論プロセスを反映できていない。このギャップを埋めるため、我々はMedQ-Benchを導入する。これは、マルチモーダル大規模言語モデル（MLLMs）を用いた医療画像品質の言語ベース評価のための知覚-推論パラダイムを確立する包括的なベンチマークである。MedQ-Benchは二つの補完的なタスクを定義する：(1) MedQ-Perceptionは、基本的な視覚的属性に関する人間がキュレートした質問を通じて低レベルの知覚能力を探る。(2) MedQ-Reasoningは、参照なし推論と比較推論の両方を含み、モデル評価を画像品質に関する人間のような推論に整合させる。このベンチマークは5つの画像モダリティと40以上の品質属性をカバーし、合計2,600の知覚クエリと708の推論評価を含む。これには、実際の臨床取得画像、物理ベースの再構成によるシミュレートされた劣化画像、AI生成画像など多様な画像ソースが含まれる。推論能力を評価するため、我々はモデル出力を4つの補完的な軸に沿って評価する多次元判定プロトコルを提案する。さらに、LLMベースの判定と放射線科医の判定を比較することで、厳密な人間-AI整合性検証を実施する。14の最先端MLLMsの評価結果は、モデルが予備的ではあるが不安定な知覚および推論スキルを示し、信頼できる臨床使用には不十分な精度であることを示している。これらの知見は、医療IQAにおけるMLLMsのターゲットを絞った最適化の必要性を強調する。MedQ-Benchがさらなる探求を促進し、医療画像品質評価におけるMLLMsの未開拓の可能性を解き放つことを期待する。

English

Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

MedQ-Bench: MLLMにおける医用画像品質評価能力の検証と探求

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

要旨

Support