HuatuoGPT-Vision，致力于将医学视觉知识大规模注入多模态LLM模型

摘要

多模态大型语言模型（MLLMs）的快速发展，比如GPT-4V，已经带来了重大进展。然而，由于医学视觉文本数据的数量和质量受限于数据隐私和高昂的标注成本，这些模型在医学多模态能力方面仍然面临挑战。尽管开创性方法利用PubMed的大规模去标识化医学图像文本对来解决这些限制，但由于固有数据噪音，它们仍然存在不足。为了解决这个问题，我们从PubMed中精炼了医学图像文本对，并利用MLLMs（GPT-4V）以“非盲目”的方式对数据进行去噪和重构，从而创建了包含130万个医学VQA样本的PubMedVision数据集。我们的验证表明：（1）PubMedVision可以显著增强当前MLLMs的医学多模态能力，在MMMU健康与医学赛道等基准测试中显示出显著改进；（2）医学专家的手动检查和经验结果验证了我们数据集相对于其他数据构建方法的数据质量优越性。利用PubMedVision，我们训练了一个34B医学MLLM HuatuoGPT-Vision，在开源MLLMs中在医学多模态场景中展现出卓越性能。

English

The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

HuatuoGPT-Vision，致力于将医学视觉知识大规模注入多模态LLM模型

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

摘要

Support