EchoVLM: 범용 초음파 인텔리전스를 위한 동적 전문가 혼합 방식의 비전-언어 모델

초록

초음파 영상은 비전리 방사선, 저비용, 실시간 영상 기능 등의 장점으로 인해 초기 암 검진을 위한 선호되는 영상 기법으로 자리 잡았습니다. 그러나 기존의 초음파 진단은 의사의 전문 지식에 크게 의존하여 높은 주관성과 낮은 진단 효율성이라는 문제를 안고 있습니다. 시각-언어 모델(VLMs)은 이러한 문제에 대한 유망한 해결책을 제공하지만, 기존의 범용 모델들은 초음파 의료 작업에 대한 지식이 제한적이며, 다중 장기 병변 인식에서의 일반화 능력이 떨어지고 다중 작업 진단에서의 효율성이 낮습니다. 이러한 한계를 극복하기 위해, 우리는 초음파 의료 영상을 위해 특별히 설계된 시각-언어 모델인 EchoVLM을 제안합니다. 이 모델은 7개의 해부학적 영역에 걸친 데이터로 훈련된 Mixture of Experts(MoE) 아키텍처를 채택합니다. 이 설계를 통해 모델은 초음파 보고서 생성, 진단, 시각 질의응답(VQA)을 포함한 다중 작업을 수행할 수 있습니다. 실험 결과, EchoVLM은 초음파 보고서 생성 작업에서 Qwen2-VL 대비 BLEU-1 점수에서 10.15점, ROUGE-1 점수에서 4.77점의 상당한 개선을 보였습니다. 이러한 결과는 EchoVLM이 초음파 영상의 진단 정확성을 향상시킬 수 있는 상당한 잠재력을 가지고 있음을 시사하며, 향후 임상 응용을 위한 실용적인 기술 솔루션을 제공할 수 있음을 보여줍니다. 소스 코드와 모델 가중치는 https://github.com/Asunatan/EchoVLM에서 확인할 수 있습니다.

English

Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.

EchoVLM: 범용 초음파 인텔리전스를 위한 동적 전문가 혼합 방식의 비전-언어 모델

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

초록

Support