MedVista3D:面向3D CT疾病检测、理解与报告诊断错误减少的视觉-语言建模
MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting
September 4, 2025
作者: Yuheng Li, Yenho Chen, Yuxiang Lai, Jike Zhong, Vanessa Wildman, Xiaofeng Yang
cs.AI
摘要
放射诊断中的错误——包括漏读错误、注意力盲区以及沟通失误——在临床实践中依然普遍存在。这些问题往往源于局部异常的遗漏、全局背景的局限以及报告语言的多样性。在三维影像中,这些挑战被进一步放大,因为临床医生需要检查每份扫描的数百个切片。解决这些问题需要具备精确局部检测、全局体积层面推理以及语义一致的自然语言报告生成能力的系统。然而,现有的三维视觉-语言模型无法同时满足这三项需求,它们缺乏对空间推理的局部-全局理解,并且在处理未经整理的放射报告时,难以应对其多样性和噪声。我们提出了MedVista3D,一个用于三维CT分析的多尺度语义增强视觉-语言预训练框架。为了实现疾病检测与整体解读的联合,MedVista3D在全体积背景下执行局部与全局的图像-文本对齐,以进行细粒度的表示学习。针对报告多样性问题,我们应用了语言模型重写技术,并引入了放射语义匹配库,以实现语义感知的对齐。MedVista3D在零样本疾病分类、报告检索和医学视觉问答任务上达到了最先进的性能,同时在器官分割和预后预测任务上展现出良好的迁移能力。代码与数据集将予以公开。
English
Radiologic diagnostic errors-under-reading errors, inattentional blindness,
and communication failures-remain prevalent in clinical practice. These issues
often stem from missed localized abnormalities, limited global context, and
variability in report language. These challenges are amplified in 3D imaging,
where clinicians must examine hundreds of slices per scan. Addressing them
requires systems with precise localized detection, global volume-level
reasoning, and semantically consistent natural language reporting. However,
existing 3D vision-language models are unable to meet all three needs jointly,
lacking local-global understanding for spatial reasoning and struggling with
the variability and noise of uncurated radiology reports. We present
MedVista3D, a multi-scale semantic-enriched vision-language pretraining
framework for 3D CT analysis. To enable joint disease detection and holistic
interpretation, MedVista3D performs local and global image-text alignment for
fine-grained representation learning within full-volume context. To address
report variability, we apply language model rewrites and introduce a Radiology
Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves
state-of-the-art performance on zero-shot disease classification, report
retrieval, and medical visual question answering, while transferring well to
organ segmentation and prognosis prediction. Code and datasets will be
released.