Med-Flamingo：一個多模態醫學少樣本學習器

摘要

醫學本質上是一個多面向的領域，需要在各種模式之間綜合信息。醫學生成式視覺語言模型（VLMs）在這方面邁出了第一步，並承諾許多令人興奮的臨床應用。然而，現有模型通常需要在龐大的下游數據集上進行微調，這構成了一個重要限制，因為在許多醫學應用中，數據稀缺，需要能夠實時從少量示例中學習的模型。在這裡，我們提出了Med-Flamingo，這是一個適應於醫學領域的多模式少樣本學習器。基於OpenFlamingo-9B，我們繼續在來自出版物和教科書的醫學圖像-文本配對和交錯數據上進行預訓練。Med-Flamingo發揮了少樣本生成式醫學視覺問答（VQA）的能力，我們在多個數據集上進行評估，包括一個新的具有挑戰性的開放式VQA數據集，其中包含視覺USMLE風格問題。此外，我們對生成式醫學VQA進行了首次人類評估，醫生們在交互式應用程序中審查問題和盲目生成。Med-Flamingo在醫學生成式VQA中的表現提高了高達20％的臨床評分，並首次實現了多模式醫學少樣本適應，例如理由生成。我們在https://github.com/snap-stanford/med-flamingo 下發布了我們的模型、代碼和評估應用程序。

English

Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time. Here we propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which we evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems. Furthermore, we conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20\% in clinician's rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation. We release our model, code, and evaluation app under https://github.com/snap-stanford/med-flamingo.

Med-Flamingo：一個多模態醫學少樣本學習器

Med-Flamingo: a Multimodal Medical Few-shot Learner

摘要

Support