不確実性を考慮した医療画像向け視覚言語セグメンテーション

要旨

我々は、放射線画像と関連する臨床テキストの両方を活用した、精密な医療診断のための新しい不確実性認識型マルチモーダルセグメンテーション枠組みを提案する。効率的なクロスモーダル融合と長距離依存関係のモデリングを可能にするため、軽量なState Space Mixer (SSMix) を備えたModality Decoding Attention Block (MoDAB) を導入する。曖昧性下での学習を導くため、空間的重なり、スペクトル一貫性、予測の不確実性を統一的に捉えるSpectral-Entropic Uncertainty (SEU) Lossを提案する。画像品質が低い複雑な臨床状況において、この定式化はモデルの信頼性を向上させる。様々な公開医療データセット（QATA-COVID19、MosMed++、Kvasir-SEG）での大規模な実験により、本手法が既存のState-of-the-Art (SoTA) 手法よりも計算効率を大幅に向上させつつ、優れたセグメンテーション性能を達成することを実証する。本結果は、視覚言語医療セグメンテーションタスクにおいて、不確実性モデリングと構造化されたモダリティ調整を組み込むことの重要性を強調する。コード: https://github.com/arya-domain/UA-VLS

English

We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: https://github.com/arya-domain/UA-VLS

不確実性を考慮した医療画像向け視覚言語セグメンテーション

Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

要旨

Support