LLaVA-Med：在一天内训练一个用于生物医学的大型语言与视觉助手

摘要

对话式生成型人工智能已展现出在赋能生物医学从业者方面的显著潜力，但当前的研究集中在单模态文本上。多模态对话型人工智能通过利用来自公共网络的数十亿图像-文本对取得了快速进展，但这种通用领域的视觉-语言模型在理解和谈论生物医学图像方面仍然缺乏复杂性。在本文中，我们提出了一种成本效益高的方法，用于训练一个视觉-语言对话助手，该助手能够回答关于生物医学图像的开放性研究问题。关键思想是利用从PubMed Central提取的大规模、广覆盖的生物医学图解说明数据集，使用GPT-4从这些说明中自我指导开放性指令遵循数据，然后利用一种新颖的课程学习方法对一个大型通用领域的视觉-语言模型进行微调。具体来说，模型首先通过图解说明对来对齐生物医学词汇，然后通过GPT-4生成的指令遵循数据来掌握开放性对话语义，广泛模拟了一个门外汉逐渐获得生物医学知识的过程。这使我们能够在不到15小时的时间内（使用八个A100）训练出一个大型语言和视觉生物医学助手（LLaVA-Med）。LLaVA-Med表现出优秀的多模态对话能力，并能够遵循开放性指令，协助查询有关生物医学图像的问题。在三个标准生物医学视觉问答数据集上，LLaVA-Med在某些指标上优于先前的监督式最先进模型。为促进生物医学多模态研究，我们将发布我们的指令遵循数据和LLaVA-Med模型。

English

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

LLaVA-Med：在一天内训练一个用于生物医学的大型语言与视觉助手

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

摘要

Support