ChatPaper.aiChatPaper

NOVA:腦部MRI異常定位與臨床推理的基準測試

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

May 20, 2025
作者: Cosmin I. Bercea, Jun Li, Philipp Raffler, Evamaria O. Riedel, Lena Schmitzer, Angela Kurz, Felix Bitzer, Paula Roßmüller, Julian Canisius, Mirjam L. Beyrle, Che Liu, Wenjia Bai, Bernhard Kainz, Julia A. Schnabel, Benedikt Wiestler
cs.AI

摘要

在許多實際應用中,部署的模型會遇到與訓練期間所見數據不同的輸入。分佈外檢測(Out-of-distribution detection)旨在識別輸入是否來自未見過的分佈,而開放世界識別(open-world recognition)則標記此類輸入,以確保系統在不斷出現的、先前未知的類別出現時仍能保持穩健,並且無需重新訓練即可應對。基礎模型和視覺語言模型(vision-language models)在大型且多樣化的數據集上進行預訓練,期望能夠跨領域廣泛泛化,包括醫學影像。然而,在僅包含少數常見異常類型的測試集上對這些模型進行基準測試,會無聲地將評估回歸到封閉集問題,掩蓋了在臨床使用中遇到的罕見或真正新穎情況下的失敗。 因此,我們提出了NOVA,這是一個具有挑戰性、僅用於評估的現實生活基準,包含900個模擬腦部MRI掃描,涵蓋281種罕見病理和異質性採集協議。每個案例都包含豐富的臨床敘述和雙盲專家邊界框註釋。這些共同促進了對異常定位、視覺描述和診斷推理的聯合評估。由於NOVA從不用於訓練,它作為分佈外泛化的極端壓力測試:模型必須在樣本外觀和語義空間上跨越分佈差距。使用領先的視覺語言模型(GPT-4o、Gemini 2.0 Flash和Qwen2.5-VL-72B)的基線結果顯示,在所有任務中性能大幅下降,這表明NOVA是一個嚴格的測試平台,用於推進能夠檢測、定位和推理真正未知異常的模型。
English
In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously unknown categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present NOVA, a challenging, real-life evaluation-only benchmark of sim900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an extreme stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

Summary

AI-Generated Summary

PDF172May 28, 2025