EchoVLM：ユニバーサル超音波インテリジェンスのための動的Mixture-of-Experts視覚言語モデル

要旨

超音波画像診断は、非電離放射線、低コスト、リアルタイム撮像能力といった利点から、早期がんスクリーニングにおける優先的な画像診断モダリティとなっている。しかし、従来の超音波診断は医師の専門知識に大きく依存しており、高い主観性と低い診断効率という課題を抱えている。視覚言語モデル（VLM）はこの問題に対する有望な解決策を提供するが、既存の汎用モデルは超音波医療タスクにおける知識が限られており、多臓器病変認識における汎化性能が低く、多タスク診断における効率も低い。これらの制約を克服するため、我々は超音波医療画像診断に特化した視覚言語モデルEchoVLMを提案する。本モデルは、7つの解剖学的領域にわたるデータで訓練されたMixture of Experts（MoE）アーキテクチャを採用している。この設計により、超音波レポート生成、診断、視覚的質問応答（VQA）を含む複数のタスクを実行することが可能となる。実験結果では、EchoVLMは超音波レポート生成タスクにおいて、Qwen2-VLと比較してBLEU-1スコアで10.15ポイント、ROUGE-1スコアで4.77ポイントの大幅な改善を達成した。これらの結果は、EchoVLMが超音波画像診断の精度向上に大きな可能性を秘めており、将来の臨床応用に向けた有効な技術的解決策を提供することを示唆している。ソースコードとモデル重みはhttps://github.com/Asunatan/EchoVLMで公開されている。

English

Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.

EchoVLM：ユニバーサル超音波インテリジェンスのための動的Mixture-of-Experts視覚言語モデル

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

要旨

Support