NOVA：脑部MRI异常定位与临床推理基准测试

摘要

在许多实际应用中，部署的模型会遇到与训练数据不同的输入。分布外检测旨在识别输入是否来自未见过的分布，而开放世界识别则标记此类输入，以确保系统在面对不断涌现的未知类别时保持鲁棒性，且无需重新训练。基础和视觉语言模型在大型多样化数据集上进行预训练，期望能在包括医学影像在内的多个领域实现广泛泛化。然而，在仅包含少数常见异常类型的测试集上对这些模型进行基准测试，会悄然将评估退化为封闭集问题，掩盖了在临床使用中遇到的罕见或真正新颖情况下的失败。因此，我们提出了NOVA，这是一个极具挑战性的、仅用于评估的现实生活基准，包含900个模拟脑部MRI扫描，涵盖281种罕见病理和异构采集协议。每个病例都包含丰富的临床叙述和双盲专家标注的边界框。这些共同支持对异常定位、视觉描述和诊断推理的联合评估。由于NOVA从未用于训练，它作为分布外泛化的极端压力测试：模型必须在样本外观和语义空间上跨越分布差距。使用领先的视觉语言模型（GPT-4o、Gemini 2.0 Flash和Qwen2.5-VL-72B）的基线结果显示，在所有任务中性能均大幅下降，确立了NOVA作为推动模型检测、定位和推理真正未知异常的严格测试平台。

English

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously unknown categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present NOVA, a challenging, real-life evaluation-only benchmark of sim900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an extreme stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.