NOVA：腦部MRI異常定位與臨床推理的基準測試

摘要

在許多實際應用中，部署的模型會遇到與訓練期間所見數據不同的輸入。分佈外檢測（Out-of-distribution detection）旨在識別輸入是否來自未見過的分佈，而開放世界識別（open-world recognition）則標記此類輸入，以確保系統在不斷出現的、先前未知的類別出現時仍能保持穩健，並且無需重新訓練即可應對。基礎模型和視覺語言模型（vision-language models）在大型且多樣化的數據集上進行預訓練，期望能夠跨領域廣泛泛化，包括醫學影像。然而，在僅包含少數常見異常類型的測試集上對這些模型進行基準測試，會無聲地將評估回歸到封閉集問題，掩蓋了在臨床使用中遇到的罕見或真正新穎情況下的失敗。因此，我們提出了NOVA，這是一個具有挑戰性、僅用於評估的現實生活基準，包含900個模擬腦部MRI掃描，涵蓋281種罕見病理和異質性採集協議。每個案例都包含豐富的臨床敘述和雙盲專家邊界框註釋。這些共同促進了對異常定位、視覺描述和診斷推理的聯合評估。由於NOVA從不用於訓練，它作為分佈外泛化的極端壓力測試：模型必須在樣本外觀和語義空間上跨越分佈差距。使用領先的視覺語言模型（GPT-4o、Gemini 2.0 Flash和Qwen2.5-VL-72B）的基線結果顯示，在所有任務中性能大幅下降，這表明NOVA是一個嚴格的測試平台，用於推進能夠檢測、定位和推理真正未知異常的模型。

English

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously unknown categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present NOVA, a challenging, real-life evaluation-only benchmark of sim900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an extreme stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.