ChatPaper.aiChatPaper

NOVA:脑部MRI异常定位与临床推理基准测试

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

May 20, 2025
作者: Cosmin I. Bercea, Jun Li, Philipp Raffler, Evamaria O. Riedel, Lena Schmitzer, Angela Kurz, Felix Bitzer, Paula Roßmüller, Julian Canisius, Mirjam L. Beyrle, Che Liu, Wenjia Bai, Bernhard Kainz, Julia A. Schnabel, Benedikt Wiestler
cs.AI

摘要

在许多实际应用中,部署的模型会遇到与训练数据不同的输入。分布外检测旨在识别输入是否来自未见过的分布,而开放世界识别则标记此类输入,以确保系统在面对不断涌现的未知类别时保持鲁棒性,且无需重新训练。基础和视觉语言模型在大型多样化数据集上进行预训练,期望能在包括医学影像在内的多个领域实现广泛泛化。然而,在仅包含少数常见异常类型的测试集上对这些模型进行基准测试,会悄然将评估退化为封闭集问题,掩盖了在临床使用中遇到的罕见或真正新颖情况下的失败。 因此,我们提出了NOVA,这是一个极具挑战性的、仅用于评估的现实生活基准,包含900个模拟脑部MRI扫描,涵盖281种罕见病理和异构采集协议。每个病例都包含丰富的临床叙述和双盲专家标注的边界框。这些共同支持对异常定位、视觉描述和诊断推理的联合评估。由于NOVA从未用于训练,它作为分布外泛化的极端压力测试:模型必须在样本外观和语义空间上跨越分布差距。使用领先的视觉语言模型(GPT-4o、Gemini 2.0 Flash和Qwen2.5-VL-72B)的基线结果显示,在所有任务中性能均大幅下降,确立了NOVA作为推动模型检测、定位和推理真正未知异常的严格测试平台。
English
In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously unknown categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present NOVA, a challenging, real-life evaluation-only benchmark of sim900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an extreme stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

Summary

AI-Generated Summary

PDF172May 28, 2025