NOVA: 脳MRIにおける異常局在化と臨床推論のためのベンチマーク

要旨

多くの実世界のアプリケーションでは、デプロイされたモデルは、トレーニング中に見たデータとは異なる入力に遭遇します。分布外検出（Out-of-distribution detection）は、入力が未知の分布に由来するかどうかを識別し、オープンワールド認識（open-world recognition）は、そのような入力をフラグ付けして、システムが新たに出現する未知のカテゴリに対しても堅牢であり続けることを保証します。ファウンデーションモデルや視覚言語モデルは、医療画像を含む幅広いドメインにわたる汎化を期待して、大規模で多様なデータセットで事前学習されています。しかし、これらのモデルを、わずかな一般的な外れ値タイプのみを含むテストセットでベンチマークすることは、評価を静かに閉じたセットの問題に戻してしまい、臨床使用で遭遇する稀または真に新しい条件での失敗を隠してしまいます。そこで我々は、281の稀な病理と異なる取得プロトコルにまたがるsim900脳MRIスキャンからなる、挑戦的で現実的な評価専用ベンチマーク「NOVA」を提案します。各ケースには、豊富な臨床記述と二重盲検の専門家によるバウンディングボックスアノテーションが含まれています。これらは、異常の局所化、視覚的キャプション生成、診断推論の共同評価を可能にします。NOVAはトレーニングに使用されることがないため、分布外汎化の極端なストレステストとして機能します：モデルは、サンプルの外観と意味空間の両方における分布ギャップを埋めなければなりません。主要な視覚言語モデル（GPT-4o、Gemini 2.0 Flash、Qwen2.5-VL-72B）を用いたベースライン結果は、すべてのタスクで大幅な性能低下を示し、NOVAが真に未知の異常を検出、局所化、推論できるモデルの進歩に向けた厳格なテストベッドであることを確立しています。

English

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously unknown categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present NOVA, a challenging, real-life evaluation-only benchmark of sim900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an extreme stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

NOVA: 脳MRIにおける異常局在化と臨床推論のためのベンチマーク

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

要旨

Support