확장 가능한 3D 의료 영상을 위한 언어-이미지 사전 학습 방향

초록

언어-이미지 사전 학습은 2D 의료 영상에서 강력한 성능을 보여왔지만, CT 및 MRI와 같은 3D 영상 방식에서는 볼륨 데이터의 높은 계산 요구량으로 인해 대규모 비정제 임상 연구 데이터에 대한 학습이 어려워 성공이 제한적이었습니다. 본 연구에서는 3D 의료 영상을 위한 확장 가능한 사전 학습 프레임워크인 계층적 주의 메커니즘을 도입한 HLIP(Hierarchical attention for Language-Image Pre-training)를 소개합니다. HLIP는 방사선 데이터의 자연스러운 계층 구조(슬라이스, 스캔, 연구)에서 영감을 받은 경량 계층적 주의 메커니즘을 채택합니다. 이 메커니즘은 강력한 일반화 능력을 보여주며, 예를 들어 CT-RATE에서 사전 학습 시 Rad-ChestCT 벤치마크에서 +4.3%의 매크로 AUC 향상을 달성했습니다. 또한, HLIP의 계산 효율성은 비정제 데이터셋에 대한 직접 학습을 가능하게 합니다. 뇌 MRI의 경우 220,000명의 환자와 313만 건의 스캔 데이터로, 두부 CT의 경우 240,000명의 환자와 144만 건의 스캔 데이터로 학습한 HLIP는 최첨단 성능을 달성했습니다. 예를 들어, 공개된 뇌 MRI 벤치마크인 Pub-Brain-5에서 +32.4%의 균형 정확도(Balanced ACC)를, 두부 CT 벤치마크인 RSNA와 CQ500에서 각각 +1.4%와 +6.9%의 매크로 AUC 향상을 보였습니다. 이러한 결과는 HLIP를 통해 비정제 임상 데이터셋에 직접 사전 학습을 수행하는 것이 3D 의료 영상에서의 언어-이미지 사전 학습을 위한 확장 가능하고 효과적인 방향임을 입증합니다. 코드는 https://github.com/Zch0414/hlip에서 확인할 수 있습니다.

English

Language-image pre-training has demonstrated strong performance in 2D medical imaging, but its success in 3D modalities such as CT and MRI remains limited due to the high computational demands of volumetric data, which pose a significant barrier to training on large-scale, uncurated clinical studies. In this study, we introduce Hierarchical attention for Language-Image Pre-training (HLIP), a scalable pre-training framework for 3D medical imaging. HLIP adopts a lightweight hierarchical attention mechanism inspired by the natural hierarchy of radiology data: slice, scan, and study. This mechanism exhibits strong generalizability, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. Moreover, the computational efficiency of HLIP enables direct training on uncurated datasets. Trained on 220K patients with 3.13 million scans for brain MRI and 240K patients with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +32.4% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +1.4% and +6.9% macro AUC on head CT benchmarks RSNA and CQ500, respectively. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip

확장 가능한 3D 의료 영상을 위한 언어-이미지 사전 학습 방향

Towards Scalable Language-Image Pre-training for 3D Medical Imaging

초록

Support