MedVista3D: 3D CTにおける疾患検出・理解・報告の診断エラー低減のための視覚-言語モデリング

要旨

放射線診断におけるエラー―見落としエラー、不注意による盲点、コミュニケーションの失敗―は、臨床現場で依然として頻繁に発生しています。これらの問題は、局所的な異常の見落とし、全体像の把握の限界、レポート言語のばらつきに起因することが多いです。これらの課題は、3D画像診断においてさらに顕著であり、臨床医は1回のスキャンで数百枚のスライスを確認しなければなりません。これらの課題に対処するためには、精密な局所検出、全体像レベルの推論、意味的に一貫した自然言語レポートを生成するシステムが必要です。しかし、既存の3D視覚言語モデルは、空間推論のための局所-全体理解が欠如しており、未整理の放射線レポートのばらつきやノイズに対処できないため、これら3つの要件を同時に満たすことができません。本論文では、3D CT解析のためのマルチスケール意味強化視覚言語事前学習フレームワークであるMedVista3Dを提案します。疾患検出と全体的な解釈を同時に行うために、MedVista3Dは全ボリュームコンテキスト内での細粒度表現学習のために、局所および全体の画像-テキストアライメントを実行します。レポートのばらつきに対処するために、言語モデルの書き換えを適用し、意味を考慮したアライメントのための放射線意味マッチングバンクを導入します。MedVista3Dは、ゼロショット疾患分類、レポート検索、医療視覚質問応答において最先端の性能を達成し、臓器セグメンテーションや予後予測にも良好に転移します。コードとデータセットは公開予定です。

English

Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.