LLaVA-Med: 생물의학을 위한 대형 언어-비전 어시스턴트를 하루 만에 훈련시키기

초록

대화형 생성 AI는 생명의학 분야 실무자들에게 유용한 잠재력을 보여주고 있지만, 현재 연구는 단일 모드인 텍스트에 집중되어 있습니다. 멀티모달 대화형 AI는 공개 웹에서 수집된 수십억 개의 이미지-텍스트 쌍을 활용하여 빠르게 발전하고 있지만, 이러한 일반 도메인의 시각-언어 모델은 여전히 생명의학 이미지를 이해하고 이에 대해 대화하는 데 있어 정교함이 부족합니다. 본 논문에서는 생명의학 이미지에 대한 개방형 연구 질문에 답할 수 있는 시각-언어 대화형 어시스턴트를 효율적으로 훈련하는 방법을 제안합니다. 핵심 아이디어는 PubMed Central에서 추출한 대규모의 광범위한 생명의학 도표-캡션 데이터셋을 활용하고, GPT-4를 사용하여 캡션에서 개방형 지시-따르기 데이터를 자가 생성한 후, 새로운 커리큘럼 학습 방법을 통해 대규모 일반 도메인 시각-언어 모델을 미세 조정하는 것입니다. 구체적으로, 모델은 먼저 도표-캡션 쌍을 그대로 사용하여 생명의학 어휘를 정렬하는 방법을 배우고, 이후 GPT-4가 생성한 지시-따르기 데이터를 사용하여 개방형 대화 의미를 숙달하는 방법을 배웁니다. 이는 일반인이 점차 생명의학 지식을 습득하는 과정을 넓게 모방한 것입니다. 이를 통해 8개의 A100 GPU를 사용하여 15시간 이내에 생명의학을 위한 대규모 언어 및 시각 어시스턴트(LLaVA-Med)를 훈련할 수 있었습니다. LLaVA-Med은 우수한 멀티모달 대화 능력을 보여주며, 생명의학 이미지에 대한 문의를 지원하기 위해 개방형 지시를 따를 수 있습니다. 세 가지 표준 생명의학 시각 질문 응답 데이터셋에서 LLaVA-Med은 특정 지표에서 이전의 지도 학습 최첨단 모델을 능가했습니다. 생명의학 멀티모달 연구를 촉진하기 위해, 우리는 지시-따르기 데이터와 LLaVA-Med 모델을 공개할 예정입니다.

English

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

LLaVA-Med: 생물의학을 위한 대형 언어-비전 어시스턴트를 하루 만에 훈련시키기

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

초록

Support