透過知識描述強化視覺語言模型的異常定位能力

摘要

視覺語言模型（VLMs）在視覺定位任務中展現了令人印象深刻的能力。然而，其在醫學領域的有效性，尤其是在醫學影像中的異常檢測與定位方面，仍未被充分探索。一個主要挑戰在於醫學術語的複雜性和抽象性，這使得直接將病理異常術語與其對應的視覺特徵關聯起來變得困難。在本研究中，我們提出了一種新穎的方法，通過利用分解的醫學知識來提升VLM在醫學異常檢測與定位中的表現。我們不直接提示模型識別特定異常，而是專注於將醫學概念分解為基本屬性和常見視覺模式。這一策略促進了文本描述與視覺特徵之間更強的對齊，從而提高了醫學影像中異常的識別與定位能力。我們在0.23B的Florence-2基礎模型上評估了我們的方法，結果顯示其在異常定位任務中與顯著更大的7B LLaVA基於醫學VLMs的表現相當，儘管僅使用了此類模型訓練數據的1.5%。實驗結果還展示了我們的方法在已知和先前未見異常中的有效性，表明其具有強大的泛化能力。

English

Visual Language Models (VLMs) have demonstrated impressive capabilities in visual grounding tasks. However, their effectiveness in the medical domain, particularly for abnormality detection and localization within medical images, remains underexplored. A major challenge is the complex and abstract nature of medical terminology, which makes it difficult to directly associate pathological anomaly terms with their corresponding visual features. In this work, we introduce a novel approach to enhance VLM performance in medical abnormality detection and localization by leveraging decomposed medical knowledge. Instead of directly prompting models to recognize specific abnormalities, we focus on breaking down medical concepts into fundamental attributes and common visual patterns. This strategy promotes a stronger alignment between textual descriptions and visual features, improving both the recognition and localization of abnormalities in medical images.We evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs, despite being trained on only 1.5% of the data used for such models. Experimental results also demonstrate the effectiveness of our approach in both known and previously unseen abnormalities, suggesting its strong generalization capabilities.

透過知識描述強化視覺語言模型的異常定位能力

Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions

摘要

Support