医療応用のためのマルチモーダルChatGPT：GPT-4Vの実験的研究

要旨

本論文では、最先端のマルチモーダル大規模言語モデル、すなわちGPT-4 with Vision（GPT-4V）のVisual Question Answering（VQA）タスクにおける能力を批判的に評価します。我々の実験では、病理学と放射線学のデータセットから11のモダリティ（例：顕微鏡、ダーモスコピー、X線、CTなど）および15の対象臓器（脳、肝臓、肺など）を用いて、画像とペアになった質問に答えるGPT-4Vの熟練度を徹底的に評価しました。我々のデータセットは、16の異なる質問タイプを含む、包括的な範囲の医学的問いを網羅しています。評価を通じて、GPT-4Vに視覚情報とテキスト情報を統合するよう指示するテキストプロンプトを考案しました。精度スコアを用いた実験の結果、現在のGPT-4Vのバージョンは、診断医学的質問に対する信頼性が低く最適ではない精度のため、実世界の診断には推奨されないと結論付けました。さらに、我々は医学的VQAにおけるGPT-4Vの振る舞いの7つの独特な側面を明らかにし、この複雑な領域における制約を強調します。評価ケースの完全な詳細はhttps://github.com/ZhilingYan/GPT4V-Medical-Reportで閲覧可能です。

English

In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets from 11 modalities (e.g. Microscopy, Dermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver, lung, etc.). Our datasets encompass a comprehensive range of medical inquiries, including sixteen distinct question types. Throughout our evaluations, we devised textual prompts for GPT-4V, directing it to synergize visual and textual information. The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics due to its unreliable and suboptimal accuracy in responding to diagnostic medical questions. In addition, we delineate seven unique facets of GPT-4V's behavior in medical VQA, highlighting its constraints within this complex arena. The complete details of our evaluation cases are accessible at https://github.com/ZhilingYan/GPT4V-Medical-Report.

医療応用のためのマルチモーダルChatGPT：GPT-4Vの実験的研究

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

要旨

Support