HuatuoGPT-Vision：医療視覚知識の大規模マルチモーダルLLMへの統合に向けて

要旨

GPT-4Vなどのマルチモーダル大規模言語モデル（MLLM）の急速な発展により、重要な進歩がもたらされています。しかし、これらのモデルは、データプライバシーの懸念や高額なアノテーションコストに起因する医療視覚テキストデータの量と質の制約により、医療マルチモーダル能力において依然として課題を抱えています。先駆的なアプローチでは、PubMedの大規模で匿名化された医療画像テキストペアを利用してこれらの制約に対処していますが、データに内在するノイズのために十分な成果を上げられていません。この問題に対処するため、我々はPubMedから医療画像テキストペアを精選し、MLLM（GPT-4V）を「非盲検」状態で使用してデータのノイズ除去と再フォーマットを行い、130万の医療VQAサンプルを含むPubMedVisionデータセットを構築しました。我々の検証により以下のことが明らかになりました：（1）PubMedVisionは、現在のMLLMの医療マルチモーダル能力を大幅に向上させ、MMMU Health & Medicineトラックを含むベンチマークで顕著な改善を示すこと、（2）医療専門家による手動チェックと実証結果により、他のデータ構築方法と比較して我々のデータセットの優れた品質が確認されたこと。PubMedVisionを使用して、我々は34Bの医療MLLM「HuatuoGPT-Vision」をトレーニングし、オープンソースのMLLMの中で医療マルチモーダルシナリオにおいて優れた性能を示しました。

English

The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

HuatuoGPT-Vision：医療視覚知識の大規模マルチモーダルLLMへの統合に向けて

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

要旨

Support