SilVar-Med: 의료 영상 내 이상 징후 탐지를 위한 설명 가능한 음성 기반 시각 언어 모델

초록

의료 시각 언어 모델(Medical Visual Language Models)은 의료 이미지 캡셔닝 및 진단 보조 등 다양한 헬스케어 애플리케이션에서 큰 잠재력을 보여주고 있습니다. 그러나 대부분의 기존 모델은 텍스트 기반 명령에 의존하고 있어, 특히 수술과 같은 실제 임상 환경에서의 사용성이 제한적입니다. 이러한 상황에서 의사들에게 텍스트 기반 상호작용은 종종 비현실적입니다. 또한, 현재의 의료 이미지 분석 모델은 일반적으로 예측에 대한 포괄적인 추론이 부족하여 임상 의사결정에 대한 신뢰도를 떨어뜨립니다. 의료 진단 오류는 생명을 바꿀 수 있는 중대한 결과를 초래할 수 있기 때문에, 해석 가능하고 합리적인 의료 보조 시스템의 필요성이 절실합니다. 이러한 문제를 해결하기 위해, 우리는 음성 상호작용을 시각 언어 모델(VLMs)과 통합한 다중모달 의료 이미지 보조 시스템인 SilVar-Med를 소개합니다. 이는 의료 이미지 분석을 위한 음성 기반 커뮤니케이션 작업을 선구적으로 수행합니다. 또한, 우리는 제안된 추론 데이터셋을 통해 각 의료 이상 예측에 대한 추론의 해석에 초점을 맞춥니다. 광범위한 실험을 통해, 우리는 음성 상호작용을 포함한 추론 기반 의료 이미지 해석에 대한 개념 증명 연구를 보여줍니다. 우리는 이 작업이 더 투명하고 상호작용적이며 임상적으로 실현 가능한 진단 지원 시스템을 촉진함으로써 의료 AI 분야를 발전시킬 것이라고 믿습니다. 우리의 코드와 데이터셋은 SiVar-Med에서 공개적으로 이용 가능합니다.

English

Medical Visual Language Models have shown great potential in various healthcare applications, including medical image captioning and diagnostic assistance. However, most existing models rely on text-based instructions, limiting their usability in real-world clinical environments especially in scenarios such as surgery, text-based interaction is often impractical for physicians. In addition, current medical image analysis models typically lack comprehensive reasoning behind their predictions, which reduces their reliability for clinical decision-making. Given that medical diagnosis errors can have life-changing consequences, there is a critical need for interpretable and rational medical assistance. To address these challenges, we introduce an end-to-end speech-driven medical VLM, SilVar-Med, a multimodal medical image assistant that integrates speech interaction with VLMs, pioneering the task of voice-based communication for medical image analysis. In addition, we focus on the interpretation of the reasoning behind each prediction of medical abnormalities with a proposed reasoning dataset. Through extensive experiments, we demonstrate a proof-of-concept study for reasoning-driven medical image interpretation with end-to-end speech interaction. We believe this work will advance the field of medical AI by fostering more transparent, interactive, and clinically viable diagnostic support systems. Our code and dataset are publicly available at SiVar-Med.

SilVar-Med: 의료 영상 내 이상 징후 탐지를 위한 설명 가능한 음성 기반 시각 언어 모델

SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging

초록

Support