ChatPaper.aiChatPaper

HuatuoGPT-Vision,朝向在大规模多模态LLM中注入醫學視覺知識的方向前進

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

June 27, 2024
作者: Junying Chen, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang
cs.AI

摘要

多模式大型語言模型(MLLMs)的快速發展,例如 GPT-4V,已帶來顯著的進展。然而,由於醫學多模式能力中存在的數據隱私問題和高昂的標註成本,這些模型仍然面臨挑戰,這導致醫學視覺文本數據的數量和質量受限。雖然一些開創性方法利用 PubMed 的大規模去識別醫學圖像文本對來解決這些限制,但由於固有的數據噪音,這些方法仍然存在不足。為了應對這一問題,我們從 PubMed 精煉了醫學圖像文本對,並在「非盲化」的情況下使用 MLLMs(GPT-4V)來去噪和重排數據,從而創建了具有 130 萬個醫學視覺問答樣本的 PubMedVision 數據集。我們的驗證表明:(1)PubMedVision 可顯著增強當前 MLLMs 的醫學多模式能力,在 MMMU 健康與醫學領域的基準測試中表現出顯著改進;(2)醫學專家的手動檢查和實證結果驗證了我們數據集相對於其他數據構建方法的優越數據質量。使用 PubMedVision,我們訓練了一個 34B 醫學 MLLM HuatuoGPT-Vision,在開源 MLLMs 中在醫學多模式情境中展現出卓越性能。
English
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

Summary

AI-Generated Summary

PDF659November 29, 2024