MedVista3D：面向3D CT疾病检测、理解与报告诊断错误减少的视觉-语言建模

摘要

放射诊断中的错误——包括漏读错误、注意力盲区以及沟通失误——在临床实践中依然普遍存在。这些问题往往源于局部异常的遗漏、全局背景的局限以及报告语言的多样性。在三维影像中，这些挑战被进一步放大，因为临床医生需要检查每份扫描的数百个切片。解决这些问题需要具备精确局部检测、全局体积层面推理以及语义一致的自然语言报告生成能力的系统。然而，现有的三维视觉-语言模型无法同时满足这三项需求，它们缺乏对空间推理的局部-全局理解，并且在处理未经整理的放射报告时，难以应对其多样性和噪声。我们提出了MedVista3D，一个用于三维CT分析的多尺度语义增强视觉-语言预训练框架。为了实现疾病检测与整体解读的联合，MedVista3D在全体积背景下执行局部与全局的图像-文本对齐，以进行细粒度的表示学习。针对报告多样性问题，我们应用了语言模型重写技术，并引入了放射语义匹配库，以实现语义感知的对齐。MedVista3D在零样本疾病分类、报告检索和医学视觉问答任务上达到了最先进的性能，同时在器官分割和预后预测任务上展现出良好的迁移能力。代码与数据集将予以公开。

English

Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.

MedVista3D：面向3D CT疾病检测、理解与报告诊断错误减少的视觉-语言建模

MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

摘要

Support