LLaVA-Med：在一天內訓練一個大型語言與視覺助手，用於生物醫學领域

摘要

對話式生成式人工智慧已展現顯著潛力，有助於賦予生物醫學從業者更多能力，但目前的研究集中在單模態文本上。多模態對話式人工智慧通過利用來自公共網路的數十億個圖像-文本對取得了快速進展，但這類通用領域的視覺語言模型在理解和對話有關生物醫學圖像方面仍缺乏複雜性。本文提出了一種成本效益高的方法，用於訓練一個能回答有關生物醫學圖像的開放性研究問題的視覺語言對話助手。關鍵思想是利用從PubMed Central提取的大規模、廣泛覆蓋的生物醫學圖說數據集，使用GPT-4從這些圖說中自我指導開放性指示遵循數據，然後利用一種新的課程學習方法對一個大型通用領域的視覺語言模型進行微調。具體而言，該模型首先學習使用圖說對齊生物醫學詞彙，然後使用GPT-4生成的指示遵循數據學習掌握開放性對話語義，廣泛模擬一個門外漢逐漸獲取生物醫學知識的過程。這使我們能夠在不到15小時的時間內（使用八個A100）訓練出一個大型語言和視覺生物醫學助手（LLaVA-Med）。LLaVA-Med展現出卓越的多模態對話能力，可以按照開放性指示協助查詢有關生物醫學圖像的問題。在三個標準的生物醫學視覺問答數據集上，LLaVA-Med在某些指標上優於先前的監督式最先進方法。為促進生物醫學多模態研究，我們將釋出我們的指示遵循數據和LLaVA-Med模型。

English

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

LLaVA-Med：在一天內訓練一個大型語言與視覺助手，用於生物醫學领域

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

摘要

Support