MedVisionLlama: 事前学習された大規模言語モデルレイヤーを活用した医用画像セグメンテーションの強化

要旨

大規模言語モデル（LLMs）は、テキストデータでの柔軟性で知られており、正確な診断画像のための重要なタスクである医用画像セグメンテーションの向上の可能性を探るために、ますます探究されています。本研究では、事前学習されたLLMトランスフォーマーブロックを統合することで、医用画像セグメンテーション向上のためにVision Transformers（ViTs）を拡張することを探求しています。私たちのアプローチは、ViTベースのモデルのエンコーダに凍結されたLLMトランスフォーマーブロックを組み込むことで、さまざまな医用画像モダリティにわたるセグメンテーションパフォーマンスの実質的な向上をもたらします。私たちは、グローバルとローカルの特徴学習を組み合わせたハイブリッドアテンションメカニズムと、異なるスケール間で特徴を集約するためのマルチスケールフュージョンブロックを提案します。この強化されたモデルは、平均ダイススコアが0.74から0.79に向上し、精度、適合率、およびジャッカード指数の向上を含む、著しいパフォーマンスの向上を示します。これらの結果は、LLMベースのトランスフォーマーが医用画像セグメンテーションの改良に効果的であり、モデルの精度と頑健性を大幅に向上させる可能性を示しています。ソースコードと当社の実装は、以下で入手可能です：https://bit.ly/3zf2CVs

English

Large Language Models (LLMs), known for their versatility in textual data, are increasingly being explored for their potential to enhance medical image segmentation, a crucial task for accurate diagnostic imaging. This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance across various medical imaging modalities. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index. These results demonstrate the effectiveness of LLM-based transformers in refining medical image segmentation, highlighting their potential to significantly boost model accuracy and robustness. The source code and our implementation are available at: https://bit.ly/3zf2CVs