NOVA: 뇌 MRI에서의 이상 징후 위치 파악 및 임상적 추론을 위한 벤치마크

초록

실제 세계의 많은 응용 분야에서, 배포된 모델들은 훈련 중에 본 데이터와 다른 입력값을 마주하게 됩니다. 분포 외 탐지는 입력값이 이전에 보지 못한 분포에서 비롯되었는지를 식별하는 반면, 개방형 세계 인식은 이러한 입력값을 표시하여 시스템이 지속적으로 등장하는 이전에 알려지지 않은 범주를 처리할 수 있도록 견고하게 유지합니다. 파운데이션 및 비전-언어 모델은 의료 영상을 포함한 다양한 도메인에 걸쳐 광범위한 일반화를 기대하며 대규모의 다양한 데이터셋에 대해 사전 훈련됩니다. 그러나 몇 가지 일반적인 이상 유형만 포함된 테스트 세트에서 이러한 모델을 벤치마킹하는 것은 평가를 암묵적으로 폐쇄형 문제로 축소시켜, 임상 사용 중에 마주치는 희귀하거나 진정으로 새로운 조건에서의 실패를 가리게 됩니다. 이에 따라 우리는 281개의 희귀 병리와 다양한 획득 프로토콜을 아우르는 sim900 뇌 MRI 스캔으로 구성된 도전적이고 현실적인 평가 전용 벤치마크인 NOVA를 제시합니다. 각 사례는 풍부한 임상 서술과 이중 맹검 전문가 바운딩 박스 주석을 포함합니다. 이를 통해 이상 현상의 위치 파악, 시각적 캡션 생성, 그리고 진단적 추론에 대한 통합 평가가 가능합니다. NOVA는 훈련에 사용되지 않기 때문에, 분포 외 일반화의 극한 스트레스 테스트 역할을 합니다: 모델은 샘플 외관과 의미 공간 모두에서 분포 격차를 극복해야 합니다. 주요 비전-언어 모델(GPT-4o, Gemini 2.0 Flash, Qwen2.5-VL-72B)의 베이스라인 결과는 모든 작업에서 상당한 성능 하락을 보여주며, NOVA가 진정으로 알려지지 않은 이상 현상을 탐지, 위치 파악, 그리고 추론할 수 있는 모델을 발전시키기 위한 엄격한 테스트베드임을 입증합니다.

English

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously unknown categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present NOVA, a challenging, real-life evaluation-only benchmark of sim900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an extreme stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

NOVA: 뇌 MRI에서의 이상 징후 위치 파악 및 임상적 추론을 위한 벤치마크

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

초록

Support