LLaVA-Med：1日でトレーニングするバイオメディシン向け大規模言語・視覚アシスタント

要旨

会話型生成AIは、生物医学分野の実践者を支援する上で大きな可能性を示していますが、現在の研究は単一モダリティのテキストに焦点を当てています。マルチモーダル会話型AIは、一般ウェブから収集された数十億の画像-テキストペアを活用することで急速な進歩を遂げていますが、そのような汎用ドメインの視覚-言語モデルは、生物医学画像の理解と会話においてまだ洗練されていません。本論文では、生物医学画像に関するオープンエンドの研究質問に答えることができる視覚-言語会話アシスタントを、コスト効率よくトレーニングするアプローチを提案します。鍵となるアイデアは、PubMed Centralから抽出された大規模で広範な生物医学図表-キャプションデータセットを活用し、GPT-4を使用してキャプションからオープンエンドの指示追従データを自己生成し、新しいカリキュラム学習法を用いて大規模な汎用視覚-言語モデルを微調整することです。具体的には、モデルはまず図表-キャプションペアを使用して生物医学用語を整列させ、次にGPT-4が生成した指示追従データを使用してオープンエンドの会話的意味を習得します。これは、一般の人々が徐々に生物医学知識を習得するプロセスを模倣しています。これにより、8台のA100を使用して15時間未満で生物医学向け大規模言語・視覚アシスタント（LLaVA-Med）をトレーニングすることが可能です。LLaVA-Medは優れたマルチモーダル会話能力を示し、生物医学画像に関する問い合わせを支援するためにオープンエンドの指示に従うことができます。3つの標準的な生物医学視覚質問応答データセットにおいて、LLaVA-Medは特定の指標で従来の教師ありの最先端モデルを上回りました。生物医学マルチモーダル研究を促進するため、指示追従データとLLaVA-Medモデルを公開します。

English

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

LLaVA-Med：1日でトレーニングするバイオメディシン向け大規模言語・視覚アシスタント

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

要旨

Support