SilVar-Med: 医療画像における異常検出のための音声駆動型視覚言語モデル

要旨

医療視覚言語モデルは、医療画像キャプショニングや診断支援など、さまざまな医療アプリケーションにおいて大きな可能性を示しています。しかし、既存のモデルの多くはテキストベースの指示に依存しており、特に手術などのシナリオでは、医師にとってテキストベースのインタラクションが非現実的であるため、実際の臨床環境での使用が制限されています。さらに、現在の医療画像分析モデルは、その予測の背後にある包括的な推論を欠いていることが多く、臨床意思決定の信頼性を低下させています。医療診断の誤りが人生を変える結果をもたらす可能性があることを考えると、解釈可能で合理的な医療支援が極めて重要です。これらの課題に対処するため、我々はエンドツーエンドの音声駆動型医療VLMであるSilVar-Medを提案します。これは、音声インタラクションとVLMを統合したマルチモーダル医療画像アシスタントであり、医療画像分析のための音声ベースのコミュニケーションを先駆的に実現します。さらに、我々は医療異常の各予測の背後にある推論の解釈に焦点を当て、提案された推論データセットを用いてこれを実現します。広範な実験を通じて、エンドツーエンドの音声インタラクションを伴う推論駆動型医療画像解釈の概念実証研究を示します。我々は、この研究が、より透明性が高く、インタラクティブで、臨床的に実用的な診断支援システムを促進することで、医療AIの分野を前進させると信じています。我々のコードとデータセットは、SiVar-Medで公開されています。

English

Medical Visual Language Models have shown great potential in various healthcare applications, including medical image captioning and diagnostic assistance. However, most existing models rely on text-based instructions, limiting their usability in real-world clinical environments especially in scenarios such as surgery, text-based interaction is often impractical for physicians. In addition, current medical image analysis models typically lack comprehensive reasoning behind their predictions, which reduces their reliability for clinical decision-making. Given that medical diagnosis errors can have life-changing consequences, there is a critical need for interpretable and rational medical assistance. To address these challenges, we introduce an end-to-end speech-driven medical VLM, SilVar-Med, a multimodal medical image assistant that integrates speech interaction with VLMs, pioneering the task of voice-based communication for medical image analysis. In addition, we focus on the interpretation of the reasoning behind each prediction of medical abnormalities with a proposed reasoning dataset. Through extensive experiments, we demonstrate a proof-of-concept study for reasoning-driven medical image interpretation with end-to-end speech interaction. We believe this work will advance the field of medical AI by fostering more transparent, interactive, and clinically viable diagnostic support systems. Our code and dataset are publicly available at SiVar-Med.

SilVar-Med: 医療画像における異常検出のための音声駆動型視覚言語モデル

SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging

要旨

Support