MedVista3D: 3D CT 질병 탐지, 이해 및 보고에서의 진단 오류 감소를 위한 시각-언어 모델링

초록

방사선학적 진단 오류 - 과소 판독 오류, 주의력 결핍적 실명, 그리고 의사소통 실패 - 는 여전히 임상 현장에서 흔히 발생합니다. 이러한 문제들은 주로 국소적 이상을 놓치거나, 전반적인 맥락이 제한적이며, 보고서 언어의 변동성에서 비롯됩니다. 이러한 과제는 3D 영상에서 더욱 두드러지는데, 이는 임상의가 스캔당 수백 개의 슬라이스를 검토해야 하기 때문입니다. 이를 해결하기 위해서는 정밀한 국소적 탐지, 전반적인 볼륨 수준의 추론, 그리고 의미론적으로 일관된 자연어 보고가 가능한 시스템이 필요합니다. 그러나 기존의 3D 시각-언어 모델들은 이 세 가지 요구 사항을 동시에 충족시키지 못하며, 공간 추론을 위한 국소-전역적 이해가 부족하고, 정제되지 않은 방사선학 보고서의 변동성과 노이즈에 어려움을 겪습니다. 우리는 3D CT 분석을 위한 다중 스케일 의미론적 시각-언어 사전 학습 프레임워크인 MedVista3D를 제안합니다. 질병 탐지와 전체적 해석을 동시에 수행하기 위해, MedVista3D는 전체 볼륨 맥락 내에서 세밀한 표현 학습을 위한 국소적 및 전역적 이미지-텍스트 정렬을 수행합니다. 보고서의 변동성을 해결하기 위해, 언어 모델 재작성을 적용하고 의미론적 정렬을 위한 방사선학 의미론 매칭 뱅크를 도입합니다. MedVista3D는 제로샷 질병 분류, 보고서 검색, 의학적 시각 질문 응답에서 최첨단 성능을 달성하며, 장기 분할 및 예후 예측으로도 잘 전이됩니다. 코드와 데이터셋은 공개될 예정입니다.

English

Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.

MedVista3D: 3D CT 질병 탐지, 이해 및 보고에서의 진단 오류 감소를 위한 시각-언어 모델링

MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

초록

Support